Colossal AI 多维TP

原创已于 2024-04-08 10:35:28 修改 · 616 阅读

4 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#人工智能 #transformer #ColossalAI #linux

于 2024-04-08 10:16:31 首次发布

本文探讨了ColossalAI中的2D、2.5D和3DTP（TensorParallelism）在Transformer模型中的应用，比较了它们在弱尺度和强尺度下的性能，并分析了不同维度分解对通信和计算的影响，涉及GPU硬件如NVIDIAQuadroRTX5000和A-100的性能测试。

Colossal AI 多维TP

1. 2D TP

1.1. SUMMA 2D 矩阵乘法

在这里插入图片描述

数值示例：

条件：每个矩阵都可以均匀的拆分为 p=q^2块（行q块，列q块·）

1.2. Transformers上的应用

b: batch size s: seq_len h: hidden size p: GPUs q: p=q^2
输入shape为{b, s, h}{bs, h}{bs/q, h/q}，实际使用时将b和h进行拆分，如下图所示。
在这里插入图片描述

通信量和计算量对比（包含activation checkpointing）

性能对比
1）Weak scaling concerns the speedup for a scaled problem size with respect to the number of processors

2）Strong scaling concerns the speedup for a fixed problem size with respect to the number of processes

3）memory performance

注：测试环境
4 NVIDIA Quadro RTX 5000 GPUs on each node, and nodes are inter-connected with Mellanox InfiniBand