【YOLOv11改进-从理论到实战-注意力机制】| 2021-伊利诺伊大学-landskape-wacv-Triplet Attention-加入自己的理解

最新推荐文章于 2026-06-25 20:35:31 发布

原创最新推荐文章于 2026-06-25 20:35:31 发布 · 587 阅读

10 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

标签

#YOLO

多模态目标检测创新点专栏收录该内容

2 篇文章

订阅专栏

【YOLOv11改进-从理论到实战-注意力机制】| 2021-伊利诺伊大学-landskape-wacv-Triplet Attention-加入自己的理解

1 本文介绍
- 引用的论文
2 Triplet Attention模块介绍
3 Triplet Attention的核心代码
4 yolov11中加入Triplet Attention

1 本文介绍

论文：Rotate to Attend-Convolutional Triplet Attention Module-2021-伊利诺伊大学-landskape-wacv-Triplet Attention
论文地址：https://arxiv.org/pdf/2010.03045.pdf
源码：https://github.com/landskape-ai/triplet-attention

引用的论文

基于改进YOLOv8的电缆复合绝缘结构内部缺陷太赫兹成像识别方法-2024-兰州交通大学- 杨栋-高电压技术

2 Triplet Attention模块介绍

1 当前研究的缺陷和不足是什么
2 论文的工作解决了什么问题，研究价值在哪里

2-1 当前研究的缺陷和不足

第一，跨维度交互缺失。传统注意力机制（如 SE、CBAM）虽分别或组合考虑了通道注意力与空间注意力，但二者是独立计算的，未捕捉通道维度（C）与空间维度（H/W）之间的跨维度交互（cross-dimension interaction）。例如 CBAM 的通道注意力与空间注意力相互割裂，无法建模 “哪些通道应关注哪些空间区域” 的依赖关系，导致特征表示不够全面。
第二，维度缩减导致信息丢失。SE、CBAM 等方法在计算通道注意力时，依赖瓶颈结构（MLP）进行维度缩减（如 SE 的通道数从 C 降至 C/r 再恢复），这种操作会造成通道与权重之间的间接对应，破坏通道间的非线性局部依赖关系，损失关键特征信息。
第三，计算与参数开销较大：多数现有注意力模块（如 SE、CBAM、GC-Net）需要额外的可学习参数（如 SE 的参数复杂度为 2C²/r），导致模型参数与计算量显著增加。例如 ResNet-50 集成 SE 后参数增加 2.514M，GC-Net 增加 2.548M，不利于在轻量化模型中部署。
空间信息压缩过度：传统通道注意力计算依赖全局平均池化（GAP）或全局最大池化（GMP），将空间维度（H×W）压缩为单像素，导致第四，大量空间细节丢失，无法建立通道特征与空间位置的精准关联。

2-2 解决了什么问题，研究价值在哪里

2-2-1 解决的核心问题

第一，提出三重注意力机制，实现跨维度交互建模。提出 Triplet Attention（三重注意力机制），通过三个并行分支分别捕捉（通道 - 高度，C×H）、（通道 - 宽度，C×W）、（高度 - 宽度，H×W）的跨维度依赖，首次将通道注意力与空间注意力的交互融入统一框架，打破了二者独立计算的局限。
第二，避免维度缩减与信息丢失。摒弃传统瓶颈结构的维度缩减操作，直接通过张量置换（permutation）、Z-Pool（融合最大 / 平均池化）和轻量级卷积生成注意力权重，保留通道与空间维度的直接对应关系，避免通道间依赖信息的损耗。
第三，极致降低计算与参数开销。设计近乎 “无额外参数” 的注意力模块，仅通过 6k² 个参数（k 为卷积核大小，通常 k=7）实现跨维度注意力计算。例如在 ResNet-50 上仅增加 4.8k 参数（0.02%）和 0.047 GFLOPs（1%）的开销，远低于 SE（2.514M 参数）、CBAM（2.532M 参数）等方法。
第四，保留丰富特征表示。通过 Z-Pool 层同时聚合最大池化与平均池化特征，避免单一池化导致的信息偏倚；结合旋转操作（rotation）实现张量维度重组，无需压缩空间维度即可建立跨维度关联，保留更多空间细节与通道特征的关联性。

2-2-2 研究价值

1 理论价值
揭示了跨维度交互对注意力机制性能的关键作用，填补了传统方法 “割裂通道与空间注意力” 的研究空白，为注意力模块设计提供了新的核心思路。
提出 “无维度缩减 + 轻量级跨维度聚合” 的设计范式，突破了 “依赖维度缩减提升效率” 的传统认知，证明低开销下仍可实现高效特征增强。

2 实践价值
性能普适性提升：在多个计算机视觉任务中验证有效性 ——ImageNet-1k 分类任务中，ResNet-50 集成后 Top-1 错误率降低 2.04%；MS COCO 目标检测中，Mask R-CNN 的 AP 提升 2.5 个百分点；PASCAL VOC 任务中 AP 超越 CBAM 2.6 个百分点。
模型兼容性强：可无缝嵌入 ResNet、MobileNetV2 等轻重量级骨干网络，无需大幅修改架构，降低了工业界应用门槛。
可解释性增强：通过 Grad-CAM/Grad-CAM++ 可视化验证，该模块能更精准地定位目标区域（如对 “哈士奇”“战机” 等类别的特征绑定更紧密），提升模型决策的可解释性。
轻量化部署友好：极小的参数与计算开销使其适配移动端、边缘设备等资源受限场景，为轻量化模型的性能提升提供了高效解决方案。

2-3 Triplet Attention的基本原理

2-4 Triplet Attention和其它简单注意力机制的对比

2-5 Triplet Attention的实现流程

1 上部分支 : 负责计算通道维度C和空间维度W的注意力权重。这个分支对输入张量进行Z池化（Z-Pool）操作，然后通过一个卷积层（Conv），接着用Sigmoid函数生成注意力权重。

2 中部分支 : 负责捕获通道维度C与空间维度H和W之间的依赖性。这个分支首先进行相同的Z池化和卷积操作，然后同样通过Sigmoid函数生成注意力权重。

3 下部分支 : 用于捕获空间维度之间的依赖性。这个分支保持输入的身份（Identity，即不改变输入），执行Z池化和卷积操作，之后也通过Sigmoid函数生成注意力权重。

每个分支在生成注意力权重后，会对输入进行排列（Permutation），然后将三个分支的输出进行平均聚合（Avg），最终得到三重注意力输出。

这种结构通过不同的旋转和排列操作，能够综合不同维度上的信息，更好地捕获数据的内在特征，同时这种方法在计算上是高效的，并且可以作为一个模块加入到现有的网络架构中，增强网络对复杂数据结构的理解和处理能力。

3 Triplet Attention的核心代码

import torch
import torch.nn as nn
 
__all__ = ['C2PSA_TripleAttention', 'TripletAttention']
 
class BasicConv(nn.Module):
    def __init__(self, in_planes, out_planes, kernel_size, stride=1, padding=0, dilation=1, groups=1, relu=True,
                 bn=True, bias=False):
        super(BasicConv, self).__init__()
        self.out_channels = out_planes
        self.conv = nn.Conv2d(in_planes, out_planes, kernel_size=kernel_size, stride=stride, padding=padding,
                              dilation=dilation, groups=groups, bias=bias)
        self.bn = nn.BatchNorm2d(out_planes, eps=1e-5, momentum=0.01, affine=True) if bn else None
        self.relu = nn.ReLU() if relu else None
 
    def forward(self, x):
        x = self.conv(x)
        if self.bn is not None:
            x = self.bn(x)
        if self.relu is not None:
            x = self.relu(x)
        return x
 
 
class ZPool(nn.Module):
    def forward(self, x):
        return torch.cat((torch.max(x, 1)[0].unsqueeze(1), torch.mean(x, 1).unsqueeze(1)), dim=1)
 
 
class AttentionGate(nn.Module):
    def __init__(self):
        super(AttentionGate, self).__init__()
        kernel_size = 7
        self.compress = ZPool()
        self.conv = BasicConv(2, 1, kernel_size, stride=1, padding=(kernel_size - 1) // 2, relu=False)
 
    def forward(self, x):
        x_compress = self.compress(x)
        x_out = self.conv(x_compress)
        scale = torch.sigmoid_(x_out)
        return x * scale
 
 
class TripletAttention(nn.Module):
    def __init__(self, no_spatial=False):
        super(TripletAttention, self).__init__()
        self.cw = AttentionGate()
        self.hc = AttentionGate()
        self.no_spatial = no_spatial
        if not no_spatial:
            self.hw = AttentionGate()
 
    def forward(self, x):
        x_perm1 = x.permute(0, 2, 1, 3).contiguous()
        x_out1 = self.cw(x_perm1)
        x_out11 = x_out1.permute(0, 2, 1, 3).contiguous()
        x_perm2 = x.permute(0, 3, 2, 1).contiguous()
        x_out2 = self.hc(x_perm2)
        x_out21 = x_out2.permute(0, 3, 2, 1).contiguous()
        if not self.no_spatial:
            x_out = self.hw(x)
            x_out = 1 / 3 * (x_out + x_out11 + x_out21)
        else:
            x_out = 1 / 2 * (x_out11 + x_out21)
        return x_out
 
def autopad(k, p=None, d=1):  # kernel, padding, dilation
    """Pad to 'same' shape outputs."""
    if d > 1:
        k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k]  # actual kernel-size
    if p is None:
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]  # auto-pad
    return p
 
 
class Conv(nn.Module):
    """Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)."""
    default_act = nn.SiLU()  # default activation
 
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
        """Initialize Conv layer with given arguments including activation."""
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()
 
    def forward(self, x):
        """Apply convolution, batch normalization and activation to input tensor."""
        return self.act(self.bn(self.conv(x)))
 
    def forward_fuse(self, x):
        """Perform transposed convolution of 2D data."""
        return self.act(self.conv(x))
 
 
def autopad(k, p=None, d=1):  # kernel, padding, dilation
    """Pad to 'same' shape outputs."""
    if d > 1:
        k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k]  # actual kernel-size
    if p is None:
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]  # auto-pad
    return p
 
 
class Conv(nn.Module):
    """Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)."""
 
    default_act = nn.SiLU()  # default activation
 
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
        """Initialize Conv layer with given arguments including activation."""
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()
 
    def forward(self, x):
        """Apply convolution, batch normalization and activation to input tensor."""
        return self.act(self.bn(self.conv(x)))
 
    def forward_fuse(self, x):
        """Perform transposed convolution of 2D data."""
        return self.act(self.conv(x))
 
class PSABlock(nn.Module):
    """
    PSABlock class implementing a Position-Sensitive Attention block for neural networks.
    This class encapsulates the functionality for applying multi-head attention and feed-forward neural network layers
    with optional shortcut connections.
    Attributes:
        attn (Attention): Multi-head attention module.
        ffn (nn.Sequential): Feed-forward neural network module.
        add (bool): Flag indicating whether to add shortcut connections.
    Methods:
        forward: Performs a forward pass through the PSABlock, applying attention and feed-forward layers.
    Examples:
        Create a PSABlock and perform a forward pass
        >>> psablock = PSABlock(c=128, attn_ratio=0.5, num_heads=4, shortcut=True)
        >>> input_tensor = torch.randn(1, 128, 32, 32)
        >>> output_tensor = psablock(input_tensor)
    """
 
    def __init__(self, c, attn_ratio=0.5, num_heads=4, shortcut=True) -> None:
        """Initializes the PSABlock with attention and feed-forward layers for enhanced feature extraction."""
        super().__init__()
 
        self.attn = TripletAttention()
        self.ffn = nn.Sequential(Conv(c, c * 2, 1), Conv(c * 2, c, 1, act=False))
        self.add = shortcut
 
    def forward(self, x):
        """Executes a forward pass through PSABlock, applying attention and feed-forward layers to the input tensor."""
        x = x + self.attn(x) if self.add else self.attn(x)
        x = x + self.ffn(x) if self.add else self.ffn(x)
        return x
 
 
class C2PSA_TripleAttention(nn.Module):
    """
    C2PSA module with attention mechanism for enhanced feature extraction and processing.
    This module implements a convolutional block with attention mechanisms to enhance feature extraction and processing
    capabilities. It includes a series of PSABlock modules for self-attention and feed-forward operations.
    Attributes:
        c (int): Number of hidden channels.
        cv1 (Conv): 1x1 convolution layer to reduce the number of input channels to 2*c.
        cv2 (Conv): 1x1 convolution layer to reduce the number of output channels to c.
        m (nn.Sequential): Sequential container of PSABlock modules for attention and feed-forward operations.
    Methods:
        forward: Performs a forward pass through the C2PSA module, applying attention and feed-forward operations.
    Notes:
        This module essentially is the same as PSA module, but refactored to allow stacking more PSABlock modules.
    Examples:
        >>> c2psa = C2PSA(c1=256, c2=256, n=3, e=0.5)
        >>> input_tensor = torch.randn(1, 256, 64, 64)
        >>> output_tensor = c2psa(input_tensor)
    """
 
    def __init__(self, c1, c2, n=1, e=0.5):
        """Initializes the C2PSA module with specified input/output channels, number of layers, and expansion ratio."""
        super().__init__()
        assert c1 == c2
        self.c = int(c1 * e)
        self.cv1 = Conv(c1, 2 * self.c, 1, 1)
        self.cv2 = Conv(2 * self.c, c1, 1)
 
        self.m = nn.Sequential(*(PSABlock(self.c, attn_ratio=0.5, num_heads=self.c // 64) for _ in range(n)))
 
    def forward(self, x):
        """Processes the input tensor 'x' through a series of PSA blocks and returns the transformed tensor."""
        a, b = self.cv1(x).split((self.c, self.c), dim=1)
        b = self.m(b)
        return self.cv2(torch.cat((a, b), 1))
 
if __name__ == "__main__":
    # Generating Sample image
    image_size = (1, 64, 224, 224)
    image = torch.rand(*image_size)
 
    # Model
    model = C2PSA_TripleAttention(64, 64)
 
    out = model(image)
    print(out.size())