Cloudberry （九）分布式事务 insert 语句

原创已于 2026-03-24 19:09:51 修改 · 396 阅读

12 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

标签

#分布式

于 2026-03-20 11:23:31 首次发布

PostgreSQL 的代码里面有一个README文件，很详细的描述了事务的各种操作关系和函数调用关系。

For example, consider the following sequence of user commands:

1)              BEGIN
2)              SELECT * FROM foo
3)              INSERT INTO foo VALUES (...)
4)              COMMIT

In the main processing loop, this results in the following function call
sequence:

     /  StartTransactionCommand;
    /       StartTransaction;
1) <    ProcessUtility;                 << BEGIN
    \       BeginTransactionBlock;
     \  CommitTransactionCommand;

    /   StartTransactionCommand;
2) /    PortalRunSelect;                << SELECT ...
   \    CommitTransactionCommand;
    \       CommandCounterIncrement;

    /   StartTransactionCommand;
3) /    ProcessQuery;                   << INSERT ...
   \    CommitTransactionCommand;
    \       CommandCounterIncrement;

     /  StartTransactionCommand;
    /   ProcessUtility;                 << COMMIT
4) <        EndTransactionBlock;
    \   CommitTransactionCommand;
     \      CommitTransaction;

begin/commit 操作，每一条语句都会被 StartTransactionCommand 和 CommitTransactionCommand (AbortCurrentTransaction) 包裹起来。

因为是 Begin 命令，所以有 BeginTransactionBlock，然后还会调用 StartTransaction 表示事务开始。

BeginTransactionBlock 是 Begin 命令专有的函数，表示后续的 SQL 语句是一个完整的事务，所以要做一些状态处理。

StartTransaction 属于底层的事务调用，无论有没有 Begin，都会调用到。

Begin 模块里面，ProcessUtility 是具体执行逻辑的地方，包含了 BeginTransactionBlock，在 Greenplum 代码里，会在 master 和 segment 上面都执行 Begin。

Insert 模块里面，ProcessQuery 是具体执行逻辑的地方，在 Cloudberry / Greenplum 代码里，会在这一步把 insert 相关的 SQL 发到对应的 segment 上面。

Commit 模块里面，CommitTransaction 是具体执行逻辑的地方。在 Cloudberry / Greenplum 代码里，有两个步骤，第一步发送 DTX_PROTOCOL_COMMAND_PREPARE 到每个 segment，第二步发送 DTX_PROTOCOL_COMMAND_COMMIT_PREPARE 到每个 segment，这就是传说中的两阶段提交。

简单的begin 命令，只用了 libpq 来做通信。如果有复杂的 Query，会使用 Greenplum 自研的 Interconnect 机制做数据交互。

用"psql -d gpadmin" 连接 master node，这个命令会先连接 postgres 主进程，然后会 fork 出一个子进程出来，这个也是 PostgreSQL 单机版的常规动作。然后从客户端执行 "begin;" 命令。这个命令是通过 libpq 的协议发送到 master node 的。

case 'Q':			/* simple query */

在 master node 上，被新 fork 出来的进程里面，会一直在"ReadCommand" 这个函数这里等待命令。

/* ----------------
 *              ReadCommand reads a command from either the frontend or
 *              standard input, places it in inBuf, and returns the
 *              message type code (first byte of the message).
 *              EOF is returned if end of file.
 * ----------------
 */
static int
ReadCommand(StringInfo inBuf)
{
        int                     result;

        SIMPLE_FAULT_INJECTOR(BeforeReadCommand);

        if (whereToSendOutput == DestRemote)
                result = SocketBackend(inBuf);
        else
                result = InteractiveBackend(inBuf);
        return result;
}

ReadCommand 是从 SocketBackend 读入数据的。读到以后，发现"firstchar == 'Q'"，所以这是一个 libpq 的 simple query，经过 query statement 的 parsing 工作，发现是一个单独的 begin 命令，就在本地执行了 BeginTransactionBlock。

这之后，开始调用 sendDtxExplicitBegin,开始做分布式的工作。

dtmPreCommand("sendDtxExplicitBegin", "(none)", NULL,
                        /* is two-phase */ true, /* withSnapshot */ true, /* inCursor */ false );

dtmPreCommand 这个函数mark 目前的分布式事务是否要使用两阶段提交协议的，

最后到 AllocateWriterGang，检测到没有 Gang ，然后开始创建 gang，

writerGang = createGang(GANGTYPE_PRIMARY_WRITER, PRIMARY_WRITER_GANG_ID, nsegdb, -1);

Gang 是 Greenplum 里面工作在不同 segment 上面，但是为了同一个 Slice 而生成的一组内存资源。 master node 上面的 Gang。

if (writerGang == NULL)
        {
                int nsegdb = getgpsegmentCount();

                insist_log(IsTransactionOrTransactionBlock(),
                                "cannot allocate segworker group outside of transaction");

                if (GangContext == NULL)
                {
                        GangContext = AllocSetContextCreate(TopMemoryContext,
                                        "Gang Context",
                                        ALLOCSET_DEFAULT_MINSIZE,
                                        ALLOCSET_DEFAULT_INITSIZE,
                                        ALLOCSET_DEFAULT_MAXSIZE);
                }
                Assert(GangContext != NULL);
                oldContext = MemoryContextSwitchTo(GangContext);

                writerGang = createGang(GANGTYPE_PRIMARY_WRITER,
                                PRIMARY_WRITER_GANG_ID, nsegdb, -1);
                writerGang->allocated = true;

                /*
                 * set "whoami" for utility statement.
                 * non-utility statement will overwrite it in function getCdbProcessList.
                 */
                for(i = 0; i < writerGang->size; i++)
                        setQEIdentifier(&writerGang->db_descriptors[i], -1, writerGang->perGangContext);

                MemoryContextSwitchTo(oldContext);
        }

GUC 叫做 gp_connections_per_thread，这个 GUC 能决定是使用多线程的方式去建立 master 到 segment 的数据库连接，还是用异步的方式来建立连接。default 的数值是 0，如果是 0 就是用异步方式，大概就是用 connect 做连接，然后 poll 做 socket 的异步监控，直到最后把所有连接都建立好，把 fd 存好。如果这个 GUC 的数值不是 0，那么就会用多线程的方式来连接 segment，起多少个线程需要根据 GUC 的值和 segment 的数量来计算。这是同一的目标的两种不同实现，code 也比较清楚，可以自己翻看源码。

因为我们用的是 default 的值，所以用的是异步的方式，通过 createGang_async 最后调用了 PQconnectStartParams,这个函数就相当于 psql 客户端执行(psql -d gpadmin)会去连接每个 segment 数据库的 postgres 进程。

这些进程也会 fork 子进程出来，然后开始准备环境，执行后续 sql 命令。代码到了这里，master 连接 segments 的工作就完成了。

后面的函数就是在发送具体的命令，就是 Begin 命令。cdbdisp_dispatchToGang 发送，因为是异步，所以 cdbdisp_waitDispatchFinish 等待发送完成，然后 cdbdisp_getDispatchResults 等待 segments 回复结果。