翻译：经典再读——BEVFormer中英文对照-CSDN博客

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
BEVFormer：通过时空变换器从多机位图像学习鸟瞰图表示

Zhiqi Li1,2∗, Wenhai Wang2∗, Hongyang Li2∗, Enze Xie3, Chonghao Sima2,
李志琦 1,2∗，王文海 2，∗，李鸿阳 2∗，谢恩泽 3，司马崇浩 2，
Tong Lu1, Yu Qiao2, Jifeng Dai2
陆桐 1，乔余 2，戴吉丰 2
1Nanjing University 2Shanghai AI Laboratory 3The University of Hong Kong
1 南京大学 2上海人工智能实验室 3 香港大学

目录

Abstract

摘要

1 Introduction

1 简介

2 Related Work

2 相关研究

2.1 Transformer-based 2D perception

2.1 基于 Transformer 的 2D 感知技术

2.2 Camera-based 3D Perception

2.2 基于摄像头的 3D 感知技术

3 BEVFormer

3.1 Overall Architecture

3.1 整体架构

3.2 BEV Queries

3.2 BEV相关查询量

3.3 Spatial Cross-Attention

3.3 空间交叉注意力

3.4 Temporal Self-Attention

3.4 时序自注意力

3.5 Applications of BEV Features

3.5 BEV特征的应用

3.6 Implementation Details

3.6 实施细节

4 Experiments

4.1 Datasets

4.1 数据集

4.2 Experimental Settings

4.2 实验设置

4.3 3D Object Detection Results

4.3 3D 物体检测结果

4.4 Multi-tasks Perception Results

4.4 多任务感知测试结果

4.5 Ablation Study

4.5 消融研究

4.6 Visualization Results

4.6 可视化结果

5 Discussion and Conclusion

5 讨论与结论

================================================================

Abstract

摘要

3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from the regions of interest across camera views. For temporal information, we propose temporal self-attention to recurrently fuse the history BEV information. Our approach achieves the new state-of-the-art 56.9% in terms of NDS metric on the nuScenes test set, which is 9.0 points higher than previous best arts and on par with the performance of LiDAR-based baselines. We further show that BEVFormer remarkably improves the accuracy of velocity estimation and recall of objects under low visibility conditions. The code is available at https://github.com/zhiqi-li/BEVFormer.
三维视觉感知任务，包括基于多机位图像的三维检测和地图分割，是自动驾驶系统的关键。在本研究中，我们提出了一个名为 BEVFormer 的新框架，它通过时空变换器学习统一的 BEV 表示，以支持多项自动驾驶感知任务。简而言之，BEVFormer 通过预定义的网格形状 BEV 查询与空间和时间空间交互，利用空间和时间信息。为了汇总空间信息，我们设计了空间交叉注意力技术，即每次 BEV 查询都能从摄像头视角中感兴趣区域提取空间特征。为了提供时间信息，我们提出时间自相关以循环融合历史 BEV 信息。我们的方法在 nuScenes 测试集上实现了新的 56.9%的 NDS 指标，比以往最佳艺术组高出 9.0 分，与基于激光雷达的基线性能相当。我们进一步证明，BEVFormer 显著提高了低能见度条件下物体的速度估计准确性和召回率。代码可在 https://github.com/zhiqi-li/BEVFormer 获取。

Refer to caption

Figure 1: We propose BEVFormer, a paradigm for autonomous driving that applies both Transformer and Temporal structure to generate bird’s-eye-view (BEV) features from multi-camera inputs. BEVFormer leverages queries to lookup spatial/temporal space and aggregate spatiotemporal information correspondingly, hence benefiting stronger representations for perception tasks.
图 1：我们提出了 BEVFormer，这是一种自动驾驶范式，结合Transformer和时间结构，从多摄像头视角输入生成鸟瞰图（BEV）特征。BEVFormer 利用查询查找时空空间并相应聚合时空信息，从而为感知任务提供了更强的表征。

1 Introduction

1 简介

Perception in 3D space is critical for various applications such as autonomous driving, robotics, etc. Despite the remarkable progress of LiDAR-based methods [43, 20, 54, 50, 8], camera-based approaches [45, 32, 47, 30] have attracted extensive attention in recent years. Apart from the low cost for deployment, cameras own the desirable advantages to detect long-range distance objects and identify vision-based road elements (e.g., traffic lights, stoplines), compared to LiDAR-based counterparts.
三维空间中的感知对于自动驾驶、机器人技术等多种应用至关重要。尽管基于激光雷达的方法取得了显著进展 [43， 20， 54， 50， 8]，但基于摄像头的方法 [45， 32， 47， 30] 近年来引起了广泛关注。除了部署成本低之外，相较于基于激光雷达的摄像头，摄像头还具备探测远距离物体和识别基于视觉的道路元素（如红绿灯、停车线）的优良优势。

Visual perception of the surrounding scene in autonomous driving is expected to predict the 3D bounding boxes or the semantic maps from 2D cues given by multiple cameras. The most straightforward solution is based on the monocular frameworks [45, 44, 31, 35, 3] and cross-camera post-processing. The downside of this framework is that it processes different views separately and cannot capture information across cameras, leading to low performance and efficiency [32, 47].
自动驾驶中对周围场景的视觉感知预期能预测三维边界框或多摄像机提供的二维线索的语义映射。最直接的解决方案基于单眼框架 [45、44、31、35、3] 和跨机后处理。该框架的缺点是它会分别处理不同的视图，无法跨摄像头捕捉信息，导致性能和效率较低[32， 47]。

As an alternative to the monocular frameworks, a more unified framework is extracting holistic representations from multi-camera images. The bird’s-eye-view (BEV) is a commonly used representation of the surrounding scene since it clearly presents the location and scale of objects and is suitable for various autonomous driving tasks, such as perception and planning [29]. Although previous map segmentation methods demonstrate BEV’s effectiveness [32, 18, 29], BEV-based approaches have not shown significant advantages over other paradigm in 3D object detections [47, 31, 34]. The underlying reason is that the 3D object detection task requires strong BEV features to support accurate 3D bounding box prediction, but generating BEV from the 2D planes is ill-posed. A popular BEV framework that generates BEV features is based on depth information [46, 32, 34], but this paradigm is sensitive to the accuracy of depth values or the depth distributions. The detection performance of BEV-based methods is thus subject to compounding errors [47], and inaccurate BEV features can seriously hurt the final performance. Therefore, we are motivated to design a BEV generating method that does not rely on depth information and can learn BEV features adaptively rather than strictly rely on 3D prior. Transformer, which uses an attention mechanism to aggregate valuable features dynamically, meets our demands conceptually.
作为单眼框架的替代方案，更统一的框架是从多机位图像中提取整体表征。鸟瞰视角（BEV）是常用的周围场景表示方式，因为它清晰地展示了物体的位置和比例，适用于各种自动驾驶任务，如感知和规划[29]。尽管以往的地图分割方法证明了 BEV 的有效性 [32， 18， 29]，基于 BEV 的方法在三维物体检测方面并未表现出相较其他范式的显著优势 [47， 31， 34]。其根本原因是三维物体检测任务需要强大的 BEV 特征来支持准确的三维边界框预测，但从二维平面生成 BEV 则是姿态错误的。一种流行的 BEV 框架基于深度信息 [46， 32， 34]，但该范式对深度值或深度分布的准确性非常敏感。基于 BEV 方法的检测性能因此存在叠加误差 [47]，不准确的 BEV 特征会严重影响最终性能。因此， 我们有动力设计一种不依赖深度信息、能够自适应学习 BEV 特征而非严格依赖三维先验的 BEV 生成方法。 Transformer 利用注意力机制动态聚合有价值特征，在概念上满足了我们的需求。

Another motivation for using BEV features to perform perception tasks is that BEV is a desirable bridge to connect temporal and spatial space. For the human visual perception system, temporal information plays a crucial role in inferring the motion state of objects and identifying occluded objects, and many works in vision fields have demonstrated the effectiveness of using video data [2, 27, 26, 33, 19]. However, the existing state-of-the-art multi-camera 3D detection methods rarely exploit temporal information. The significant challenges are that autonomous driving is time-critical and objects in the scene change rapidly, and thus simply stacking BEV features of cross timestamps brings extra computational cost and interference information, which might not be ideal. Inspired by recurrent neural networks (RNNs) [17, 10], we utilize the BEV features to deliver temporal information from past to present recurrently, which has the same spirit as the hidden states of RNN models.
使用 BEV 特征来执行感知任务的另一个原因是，BEV 是一种理想的桥梁，能够连接时间和空间信息。在人类视觉感知系统中，时间信息对于推断物体的运动状态和识别被遮挡的物体至关重要。视觉领域的许多研究都证明了利用视频数据的有效性 [2, 27, 26, 33, 19]。然而，现有的最先进多摄像头 3D 检测方法很少利用时间信息。面临的重大挑战在于：自动驾驶对时间响应要求极高，且场景中的物体变化迅速。因此，简单地将不同时间戳下的 BEV 特征叠加在一起，会带来额外的计算成本和干扰信息，这种做法可能并不理想。受循环神经网络（RNNs）的启发 [17, 10]，我们利用 BEV 特征循环地传递从过去到现在的信息，这一机制与 RNN 模型的隐藏状态类似。

To this end, we present a transformer-based bird’s-eye-view (BEV) encoder, termed BEVFormer, which can effectively aggregate spatiotemporal features from multi-view cameras and history BEV features. The BEV features generated from the BEVFormer can simultaneously support multiple 3D perception tasks such as 3D object detection and map segmentation, which is valuable for the autonomous driving system. As shown in Fig. 1, our BEVFormer contains three key designs, which are (1) grid-shaped BEV queries to fuse spatial and temporal features via attention mechanisms flexibly, (2) spatial cross-attention module to aggregate the spatial features from multi-camera images, and (3) temporal self-attention module to extract temporal information from history BEV features, which benefits the velocity estimation of moving objects and the detection of heavily occluded objects, while bringing negligible computational overhead. With the unified features generated by BEVFormer, the model can collaborate with different task-specific heads such as Deformable DETR [56] and mask decoder [22], for end-to-end 3D object detection and map segmentation.
为此，我们提出了一种基于 Transformer 的鸟瞰图（BEV）编码器，命名为 BEVFormer。该编码器能够有效整合多视图摄像机获取的时空特征以及历史鸟瞰图特征。通过 BEVFormer 生成的鸟瞰图特征可同时支持多种 3D 感知任务，如 3D 物体检测和地图分割，这对自动驾驶系统而言具有重要意义。如图 1 所示，BEVFormer 包含三个关键设计要素：（1）网格状的鸟瞰图查询机制，可通过注意力机制灵活地融合空间和时间特征；（2）空间交叉注意力模块，用于整合多摄像机图像中的空间特征；（3）时间自注意力模块，用于从历史鸟瞰图特征中提取时间信息。这些设计有助于移动物体的速度估计和被严重遮挡物体的检测，且几乎不会增加计算成本。借助 BEVFormer 生成的统一特征，该模型可与 Deformable DETR [56]、掩码解码器 [22] 等任务专用模块协同工作，实现端到端的 3D 物体检测和地图分割。

Our main contributions are as follows:
我们的主要贡献如下：

∙ We propose BEVFormer, a spatiotemporal transformer encoder that projects multi-camera and/or timestamp input to BEV representations. With the unified BEV features, our model can simultaneously support multiple autonomous driving perception tasks, including 3D detection and map segmentation.
∙ 我们提出了 BEVFormer，这是一种时空变换编码器，能够将多摄像头采集的图像和/或时间戳信息转换为 BEV 表示形式。借助统一的 BEV 特征，我们的模型能够同时处理多种自动驾驶感知任务，包括 3D 检测和地图分割。

∙ We designed learnable BEV queries along with a spatial cross-attention layer and a temporal self-attention layer to lookup spatial features from cross cameras and temporal features from history BEV, respectively, and then aggregate them into unified BEV features.
∙ 我们设计了可学习的 BEV 查询机制，结合空间交叉注意力层和时间自注意力层，分别从多摄像头获取空间特征，从历史 BEV 数据中获取时间特征，再将这些特征整合为统一的 BEV 特征。

∙ We evaluate the proposed BEVFormer on multiple challenging benchmarks, including nuScenes [4] and Waymo [40]. Our BEVFormer consistently achieves improved performance compared to the prior arts. For example, under a comparable parameters and computation overhead, BEVFormer achieves 56.9% NDS on nuScenes test set, outperforming previous best detection method DETR3D [47] by 9.0 points (56.9% vs. 47.9%). For the map segmentation task, we also achieve the state-of-the-art performance, more than 5.0 points higher than Lift-Splat [32] on the most challenging lane segmentation. We hope this straightforward and strong framework can serve as a new baseline for following 3D perception tasks.
∙ 我们在多个具有挑战性的基准测试上对所提出的 BEVFormer 进行了评估，包括 nuScenes [4] 和 Waymo [40]。与现有方法相比，BEVFormer 始终表现出更优的性能。例如，在参数和计算开销相当的情况下，BEVFormer 在 nuScenes 测试集上的 NDS 得分达到了 56.9%，比之前的最佳检测方法 DETR3D [47] 高出 9.0 个百分点（56.9% 对比 47.9%）。在地图分割任务中，我们也取得了最先进的性能：在最具有挑战性的车道分割任务上，比 Lift-Splat [32] 高出 5.0 个百分点以上。我们希望这一简单而强大的框架能够成为后续 3D 感知任务的新基准。

2 Related Work

2 相关研究

2.1 Transformer-based 2D perception

2.1 基于 Transformer 的 2D 感知技术

Recently, a new trend is to use transformer to reformulate detection and segmentation tasks [7, 56, 22].
最近，一种新的趋势是使用 Transformer 来重新构建检测和分割任务 [7, 56, 22]。

DETR [7] uses a set of object queries to generate detection results by the cross-attention decoder directly. However, the main drawback of DETR is the long training time. Deformable DETR [56] solves this problem by proposing deformable attention. Different from vanilla global attention in DETR, the deformable attention interacts with local regions of interest, which only samples K points near each reference point and calculates attention results, resulting in high efficiency and significantly shortening the training time. The deformable attention mechanism is calculated by:
DETR [7] 通过交叉注意力解码器直接利用一系列对象查询来生成检测结果。不过，DETR 的主要缺点是训练时间过长。Deformable DETR [56] 通过引入可变形注意力机制解决了这一问题。与 DETR 中的普通全局注意力不同，可变形注意力会与局部感兴趣区域进行交互：它仅在每个参考点附近采样 K 个点来计算注意力值，从而显著提升了效率并缩短了训练时间。可变形注意力机制的计算公式如下：

DeformAttn(q,p,x)=∑i=1Nhead𝒲i∑j=1Nkey𝒜ij⋅𝒲i′x(p+Δpij),

(1)

where q, p, x represent the query, reference point and input features, respectively. i indexes the attention head, and Nhead denotes the total number of attention heads. j indexes the sampled keys, and Nkey is the total sampled key number for each head. Wi∈ℝC×(C/Hhead) and Wi′∈ℝ(C/Hhead)×C are the learnable weights, where C is the feature dimension. Aij∈[0,1] is the predicted attention weight, and is normalized by ∑j=1NkeyAij=1. Δpij∈ℝ2 are the predicted offsets to the reference point p. x(p+Δpij) represents the feature at location p+Δpij, which is extracted by bilinear interpolation as in Dai et al. [12]. In this work, we extend the deformable attention to 3D perception tasks, to efficiently aggregate both spatial and temporal information.
其中， q 、 p 、 x 分别代表查询、参考点和输入特征。 i 用于索引注意力头， Nhead 则表示注意力头的总数。 j 用于索引被抽样的键， Nkey 则是所有被抽样键的总数。每个头的编号。 Wi∈ℝC×(C/Hhead) 和 Wi′∈ℝ(C/Hhead)×C 是可学习的权重，其中 C 表示特征维度。 Aij∈[0,1] 是预测得到的注意力权重，经过 ∑j=1NkeyAij=1 进行归一化处理。 Δpij∈ℝ2 是相对于参考点 p 的预测偏移量。 x(p+Δpij) 表示位于 p+Δpij 位置的特征，该特征是通过双线性插值提取的，具体方法如 Dai 等人 [12] 所述。在本研究中，我们将这种可变形注意力机制扩展到 3D 感知任务中，以便高效地整合空间和时间信息。

2.2 Camera-based 3D Perception

2.2 基于摄像头的 3D 感知技术

Previous 3D perception methods typically perform 3D object detection or map segmentation tasks independently. For the 3D object detection task, early methods are similar to 2D detection methods [1, 28, 49, 39, 53], which usually predict the 3D bounding boxes based on 2D bounding boxes. Wang et al. [45] follows an advanced 2D detector FCOS [41] and directly predicts 3D bounding boxes for each object. DETR3D [47] projects learnable 3D queries in 2D images, and then samples the corresponding features for end-to-end 3D bounding box prediction without NMS post-processing. Another solution is to transform image features into BEV features and predict 3D bounding boxes from the top-down view. Methods transform image features into BEV features with the depth information from depth estimation [46] or categorical depth distribution [34]. OFT [36] and ImVoxelNet [37] project the predefined voxels onto image features to generate the voxel representation of the scene. Recently, M2BEV [48] futher explored the feasibility of simultaneously performing multiple perception tasks based on BEV features.
以往的 3D 感知方法通常分别独立地执行 3D 物体检测或地图分割任务。在 3D 物体检测方面，早期方法与 2D 检测方法类似 [1, 28, 49, 39, 53]，这些方法通常基于 2D 边界框来预测 3D 边界框。Wang 等人 et al.[45] 采用了先进的 2D 检测器 FCOS[41]，直接为每个物体预测 3D 边界框。DETR3D[47] 将可学习的 3D 查询投影到 2D 图像中，然后采样相应的特征，从而实现端到端的 3D 边界框预测，无需进行 NMS 后处理。另一种解决方案是将图像特征转换为 BEV 特征，然后从俯视角度预测 3D 边界框。这些方法通过深度估计的深度信息 [46] 或分类式深度分布 [34]，将图像特征转换为 BEV 特征。OFT[36] 和 ImVoxelNet[37] 将预定义的体素投影到图像特征上，从而生成场景的体素表示。最近，M2BEV[48] 进一步探讨了基于 BEV 特征同时执行多项感知任务的可行性。

Actually, generating BEV features from multi-camera features is more extensively studied in map segmentation tasks [32, 30]. A straightforward method is converting perspective view into the BEV through Inverse Perspective Mapping (IPM) [35, 5]. In addition, Lift-Splat [32] generates the BEV features based on the depth distribution. Methods [30, 16, 9] utilize multilayer perceptron to learn the translation from perspective view to the BEV. PYVA [51] proposes a cross-view transformer that converts the front-view monocular image into the BEV, but this paradigm is not suitable for fusing multi-camera features due to the computational cost of global attention mechinism [42]. In addition to the spatial information, previous works [18, 38, 6] also consider the temporal information by stacking BEV features from several timestamps. Stacking BEV features constraints the available temporal information within fixed time duration and brings extra computational cost. In this work, the proposed spatiotemporal transformer generates BEV features of the current time by considering both spatial and temporal clues, and the temporal information is obtained from the previous BEV features by the RNN manner, which only brings little computational cost.
实际上，在地图分割任务中，利用多摄像头数据生成 BEV 特征的研究更为广泛 [32, 30]。一种直接的方法是通过逆透视映射（IPM）将透视视图转换为 BEV[35, 5]。此外，Lift-Splat 方法 [32] 则基于深度分布来生成 BEV 特征。还有一些方法 [30, 16, 9] 利用多层感知器来学习从透视视图到 BEV 的转换。PYVA 方法 [51] 提出了一种跨视图变换器，可将前置单目图像转换为 BEV，但由于全局注意力机制的计算成本较高，该方案并不适合多摄像头数据的融合。[42]。除了空间信息外，先前的研究 [18, 38, 6] 还通过叠加多个时间点的 BEV 特征来考虑时间信息。叠加 BEV 特征会将时间信息限制在固定的时间范围内，同时会增加额外的计算成本。在本研究中，所提出的时空变换器通过同时考虑空间和时间信息来生成当前时间的 BEV 特征；时间信息则是通过 RNN 方式从之前的 BEV 特征中获取的，这种方式几乎不增加计算成本。

3 BEVFormer

Converting multi-camera image features to bird’s-eye-view (BEV) features can provide a unified surrounding environment representation for various autonomous driving perception tasks. In this work, we present a new transformer-based framework for BEV generation, which can effectively aggregate spatiotemporal features from multi-view cameras and history BEV features via attention mechanisms.
将多摄像头采集的图像特征转换为鸟瞰图特征，可以为各种自动驾驶感知任务提供统一的周围环境表示方式。在本研究中，我们提出了一种基于 Transformer 的鸟瞰图生成框架，该框架能够通过注意力机制有效整合多视角摄像头采集的时空特征以及先前的鸟瞰图特征。

Refer to caption

Figure 2: Overall architecture of BEVFormer. (a) The encoder layer of BEVFormer contains grid-shaped BEV queries, temporal self-attention, and spatial cross-attention. (b) In spatial cross-attention, each BEV query only interacts with image features in the regions of interest. (c) In temporal self-attention, each BEV query interacts with two features: the BEV queries at the current timestamp and the BEV features at the previous timestamp.
图 2：BEVFormer 的整体架构。（a）BEVFormer 的编码器层包含网格状的 BEV 查询、时间自注意力以及空间交叉注意力。（b）在空间交叉注意力中，每个 BEV 查询仅与感兴趣区域内的图像特征进行交互。（c）在时间自注意力中，每个 BEV 查询与两种特征进行交互：当前时间戳下的 BEV 查询，以及上一个时间戳下的 BEV 特征。

3.1 Overall Architecture

3.1 整体架构

As illustrated in Fig. 2, BEVFormer has 6 encoder layers, each of which follows the conventional structure of transformers [42], except for three tailored designs, namely BEV queries, spatial cross-attention, and temporal self-attention. Specifically, BEV queries are grid-shaped learnable parameters, which is designed to query features in BEV space from multi-camera views via attention mechanisms. Spatial cross-attention and temporal self-attention are attention layers working with BEV queries, which are used to lookup and aggregate spatial features from multi-camera images as well as temporal features from history BEV, according to the BEV query.
如图 2 所示，BEVFormer 包含 6 个编码器层，每个层都遵循transformers 的常规结构 [42]，不过其中有三个是经过特殊设计的模块：BEV 查询、空间交叉注意力与时序自注意力。具体而言，BEV 查询是一种网格状的可学习参数，其作用是通过注意力机制从多摄像头视角中提取 BEV 空间中的特征。空间交叉注意力与时序自注意力则是与 BEV 查询协同工作的注意力层，它们根据 BEV 查询的结果，从多摄像头图像中提取空间特征，同时从历史 BEV 数据中提取时序特征并进行聚合。

During inference, at timestamp t, we feed multi-camera images to the backbone network (e.g., ResNet-101 [15]), and obtain the features Ft={Fti}i=1Nview of different camera views, where Fti is the feature of the i-th view, Nview is the total number of camera views. At the same time, we preserved the BEV features Bt−1 at the prior timestamp t−1. In each encoder layer, we first use BEV queries Q to query the temporal information from the prior BEV features Bt−1 via the temporal self-attention. We then employ BEV queries Q to inquire about the spatial information from the multi-camera features Ft via the spatial cross-attention. After the feed-forward network [42], the encoder layer output the refined BEV features, which is the input of the next encoder layer. After 6 stacking encoder layers, unified BEV features Bt at current timestamp t are generated. Taking the BEV features Bt as input, the 3D detection head and map segmentation head predict the perception results such as 3D bounding boxes and semantic map.
在推理过程中，于时间戳 t 时，我们将多摄像头采集的图像输入到主干网络中（例如：ResNet-101 [15]），从而获取特征信息。 Ft={Fti}i=1Nview 种不同的摄像机视角，其中 Fti 是第 i 种视角的特征。 Nview 表示所有摄像机的视图总数。同时，我们保留了在先前时间戳 t−1 处的 BEV 特征 Bt−1 。在每个编码器层中，我们首先使用 BEV 查询 Q ，通过时间自注意力从先前的 BEV 特征 Bt−1 中获取时间信息。随后，我们再利用 BEV 查询 Q ，通过空间交叉注意力从多摄像机特征 Ft 中获取空间信息。经过前馈网络 [42] 后，编码器层输出经过处理的 BEV 特征，这些特征将作为下一层编码器的输入。经过 6 层编码器的处理后，便能得到当前时间戳 t 下的统一 BEV 特征 Bt 。以这些 BEV特征Bt 为输入，3D 检测模块和地图分割模块可输出 3D 边界框、语义地图等感知结果。

3.2 BEV Queries

3.2 BEV相关查询量

We predefine a group of grid-shaped learnable parameters Q∈ℝH×W×C as the queries of BEVFormer, where H,W are the spatial shape of the BEV plane. To be specific, the query Qp∈ℝ1×C located at p=(x,y) of Q is responsible for the corresponding grid cell region in the BEV plane. Each grid cell in the BEV plane corresponds to a real-world size of s meters. The center of BEV features corresponds to the position of the ego car by default. Following common practices [14], we add learnable positional embedding to BEV queries Q before inputting them to BEVFormer.
我们预先定义了一组网格状的可学习参数 Q∈ℝH×W×C ，作为 BEVFormer 的查询输入，其中 H,W 代表了 BEV 平面的空间结构。具体而言，位于 Q 的 p=(x,y) 处的查询 Qp∈ℝ1×C 负责处理 BEV 平面中对应的网格单元区域。BEV 平面上的每个网格单元对应现实世界中的 s 米大小。默认情况下，BEV 特征的中心位置对应于本车位置。遵循常见做法 [14]，我们在将查询 Q 输入 BEVFormer 之前，会为其添加可学习的位置嵌入信息。

3.3 Spatial Cross-Attention

3.3 空间交叉注意力

Due to the large input scale of multi-camera 3D perception (containing Nview camera views), the computational cost of vanilla multi-head attention [42] is extremely high. Therefore, we develop the spatial cross-attention based on deformable attention [56], which is a resource-efficient attention layer where each BEV query Qp only interacts with its regions of interest across camera views. However, deformable attention is originally designed for 2D perception, so some adjustments are required for 3D scenes.
由于多摄像头 3D 感知的输入规模巨大（包含 Nview 个摄像头视角），传统多头注意力机制的计算成本极高 [42]。因此，我们提出了基于可变形注意力的空间交叉注意力机制 [56]。这是一种资源效率较高的注意力层：每个 BEV 查询 Qp 仅与跨摄像头视角的感兴趣区域进行交互。不过，可变形注意力最初是为 2D 感知设计的，因此需要针对 3D 场景进行一些调整。

As shown in Fig. 2 (b), we first lift each query on the BEV plane to a pillar-like query [20], sample Nref 3D reference points from the pillar, and then project these points to 2D views. For one BEV query, the projected 2D points can only fall on some views, and other views are not hit. Here, we term the hit views as 𝒱hit. After that, we regard these 2D points as the reference points of the query Qp and sample the features from the hit views 𝒱hit around these reference points. Finally, we perform a weighted sum of the sampled features as the output of spatial cross-attention. The process of spatial cross-attention (SCA) can be formulated as:
如图 2（b）所示，我们首先将每个查询在 BEV 平面上转换为类似“柱状”的查询 [20]，从这些“柱状”结构中采样 Nref 个 3D 参考点，再将这些点投影到 2D 视图上。对于某个 BEV 查询，其投影后的 2D 点只会出现在某些视图上，而其他视图则不会被覆盖。我们将被覆盖的视图称为 𝒱hit 。之后，我们将这些 2D 点视为查询 Qp 的参考点，并从这些被覆盖的视图 𝒱hit 中采样特征。最后，通过对采样特征进行加权求和，得到空间交叉注意力的输出结果。空间交叉注意力（SCA）的运算过程可表述为：

SCA(Qp,Ft)

=1|𝒱hit|∑i∈𝒱hit∑j=1NrefDeformAttn(Qp,𝒫(p,i,j),Fti),

(2)

where i indexes the camera view, j indexes the reference points, and Nref is the total reference points for each BEV query. Fti is the features of the i-th camera view. For each BEV query Qp, we use a project function 𝒫(p,i,j) to get the j-th reference point on the i-th view image.
其中， i 用于索引摄像机视角， j 用于索引参考点， Nref 则是每次 BEV 查询对应的参考点总数。 Fti 表示第 i 个摄像机视角的特征信息。对于每次 BEV 查询 Qp ，我们使用投影函数 𝒫(p,i,j) 来确定第 j 个参考点在 i 视图图像上的位置。

Next, we introduce how to obtain the reference points on the view image from the projection function 𝒫. We first calculate the real world location (x′,y′) corresponding to the query Qp located at p=(x,y) of Q as Eqn. 3.
接下来，我们介绍如何根据投影函数 𝒫 获取视图图像上的参考点。首先，根据方程式 3，计算位于 Q 的 p=(x,y) 处的查询点 Qp 所对应的真实世界位置 (x′,y′) 。

x′=(x−W2)×s;y′=(y−H2)×s,

(3)

where H, W are the spatial shape of BEV queries, s is the size of resolution of BEV’s grids, and (x′,y′) are the coordinates where the position of ego car is the origin. In 3D space, the objects located at (x′,y′) will appear at the height of z′ on the z-axis. So we predefine a set of anchor heights {zj′}j=1Nref to make sure we can capture clues that appeared at different heights. In this way, for each query Qp, we obtain a pillar of 3D reference points (x′,y′,zj′)j=1Nref. Finally, we project the 3D reference points to different image views through the projection matrix of cameras, which can be written as:
其中， H 和 W 表示 BEV 查询的空间形状， s 是 BEV 网格的分辨率大小， (x′,y′) 则是以本车位置为原点的坐标系。在三维空间中，位于 (x′,y′) 处的物体将在 z 轴上 z′ 的高度处显示。因此，我们预先定义了一组锚点高度 {zj′}j=1Nref ，以确保能够捕捉到出现在不同高度的线索。通过这种方式，对于每个查询 Qp ，我们都能获得一组三维参考点 (x′,y′,zj′)j=1Nref 。最后，我们利用摄像机的投影矩阵将这些三维参考点投影到不同的图像视图上，其公式可表示为：

𝒫(p,i,j)=(xij,yij)wherezij⋅[xijyij1]T=Ti⋅[x′y′zj′1]T.

(4)

Here, 𝒫(p,i,j) is the 2D point on i-th view projected from j-th 3D point (x′,y′,zj′), Ti∈ℝ3×4 is the known projection matrix of the i-th camera.
这里， 𝒫(p,i,j) 是从第 j 个 3D 点 (x′,y′,zj′) 投影到第 i 个视图上的 2D 点， Ti∈ℝ3×4 则是第 i 个相机的已知投影矩阵。

3.4 Temporal Self-Attention

3.4 时序自注意力

In addition to spatial information, temporal information is also crucial for the visual system to understand the surrounding environment [27]. For example, it is challenging to infer the velocity of moving objects or detect highly occluded objects from static images without temporal clues. To address this problem, we design temporal self-attention, which can represent the current environment by incorporating history BEV features.
除了空间信息外，时间信息对于视觉系统理解周围环境也至关重要 [27]。例如，如果没有时间线索，仅从静态图像中推断运动物体的速度或检测被严重遮挡的物体是非常困难的。为了解决这个问题，我们设计了时间自注意力机制，该机制通过结合历史 BEV 特征来表征当前环境。

Given the BEV queries Q at current timestamp t and history BEV features Bt−1 preserved at timestamp t−1, we first align Bt−1 to Q according to ego-motion to make the features at the same grid correspond to the same real-world location. Here, we denote the aligned history BEV features Bt−1 as Bt−1′. However, from times t−1 to t, movable objects travel in the real world with various offsets. It is challenging to construct the precise association of the same objects between the BEV features of different times. Therefore, we model this temporal connection between features through the temporal self-attention (TSA) layer, which can be written as follows:
考虑到当前时间戳 t 下的 BEV 查询 Q ，以及时间戳 t−1 时保存的历史 BEV 特征 Bt−1 ，我们首先根据车辆自身的运动情况，将 Bt−1 与 Q 对齐，使得同一网格上的特征对应于现实世界中的同一位置。这里，我们将对齐后的历史 BEV 特征 Bt−1 记为 Bt−1′ 。然而，从时间点 t−1 到 t ，现实世界中的可移动物体会存在各种位移。因此，要建立不同时刻 BEV 特征之间同一物体的精确对应关系颇具挑战性。为此，我们通过时间自注意力（TSA）层来建模特征之间的这种时间关联，其表达式如下：

TSA(Qp,{Q,Bt−1′})=∑V∈{Q,Bt−1′}DeformAttn(Qp,p,V),

(5)

where Qp denotes the BEV query located at p=(x,y). In addition, different from the vanilla deformable attention, the offsets Δp in temporal self-attention are predicted by the concatenation of Q and Bt−1′. Specially, for the first sample of each sequence, the temporal self-attention will degenerate into a self-attention without temporal information, where we replace the BEV features {Q,Bt−1′} with duplicate BEV queries {Q,Q}.
其中， Qp 表示位于 p=(x,y) 的 BEV 查询。此外，与普通的可变形注意力机制不同，时间自注意力中的偏移量 Δp 是通过 Q 和 Bt−1′ 的拼接来预测的。特别地，对于每个序列的第一个样本，时间自注意力会退化为不包含时间信息的全局自注意力；此时，我们会用重复的 BEV 查询 {Q,Q} 来替代 BEV 特征 {Q,Bt−1′} 。

Compared to simply stacking BEV in [18, 38, 6], our temporal self-attention can more effectively model long temporal dependency. BEVFormer extracts temporal information from the previous BEV features rather than multiple stacking BEV features, thus requiring less computational cost and suffering less disturbing information.
与简单地将 BEV 特征进行堆叠相比 [18, 38, 6]，我们的时间自注意力机制能更有效地捕捉长时间依赖关系。BEVFormer 从之前的 BEV 特征中提取时间信息，而非叠加多个 BEV 特征，因此计算成本更低，也较少受到干扰性信息的影响。

3.5 Applications of BEV Features

3.5 BEV特征的应用

Since the BEV features Bt∈ℝH×W×C is a versatile 2D feature map that can be used for various autonomous driving perception tasks, the 3D object detection and map segmentation task heads can be developed based on 2D perception methods [56, 22] with minor modifications.
由于 BEV 所使用的 Bt∈ℝH×W×C 是一种功能多样的 2D 特征图，可用于各种自动驾驶感知任务，因此基于 2D 感知方法 [56, 22] 稍作修改，即可开发出用于 3D 物体检测和地图分割的任务模块。

For 3D object detection, we design an end-to-end 3D detection head based on the 2D detector Deformable DETR [56]. The modifications include using single-scale BEV features Bt as the input of the decoder, predicting 3D bounding boxes and velocity rather than 2D bounding boxes, and only using L1 loss to supervise 3D bounding box regression. With the detection head, our model can end-to-end predict 3D bounding boxes and velocity without the NMS post-processing.
在 3D 物体检测方面，我们基于 2D 检测器 Deformable DETR [56] 设计了端到端的 3D检测模块。具体改进包括：使用单尺度 BEV 特征作为解码器的输入；预测3D边界框和速度而非 2D 边界框；仅使用 L1 损失函数来指导 3D 边界框的回归过程。借助该检测模块，我们的模型无需进行 NMS 后处理即可端到端地预测 3D 边界框和速度。

For map segmentation, we design a map segmentation head based on a 2D segmentation method Panoptic SegFormer [22]. Since the map segmentation based on the BEV is basically the same as the common semantic segmentation, we utilize the mask decoder of [22] and class-fixed queries to target each semantic category, including the car, vehicles, road (drivable area), and lane.
在地图分割方面，我们基于 2D 分割算法 Panoptic SegFormer 设计了地图分割模块 [22]。由于基于 BEV 的地图分割本质上与常规的语义分割相同，我们采用了 [22] 中的掩码解码器，并通过固定类别的查询来识别各类语义对象，包括汽车、其他车辆、道路（可行驶区域）和车道。

3.6 Implementation Details

3.6 实施细节

Training Phase. For each sample at timestamp t, we randomly sample another 3 samples from the consecutive sequence of the past 2 seconds, and this random sampling strategy can augment the diversity of ego-motion [57]. We denote the timestamps of these four samples as t−3, t−2, t−1 and t. For the samples of the first three timestamps, they are responsible for recurrently generating the BEV features {Bt−3,Bt−2,Bt−1} and this phase requires no gradients. For the first sample at timestamp t−3, there is no previous BEV features, and temporal self-attention degenerate into self-attention. At the time t, the model generates the BEV features Bt based on both multi-camera inputs and the prior BEV features Bt−1, so that Bt contains the temporal and spatial clues crossing the four samples. Finally, we feed the BEV features Bt into the detection and segmentation heads and compute the corresponding loss functions.
训练阶段。对于时间戳为 t 的每个样本，我们从过去 2 秒内的连续样本中随机抽取另外 3 个样本。这种随机采样策略有助于提升自我运动的多样性 [57]。我们将这 4 个样本的时间戳分别标记为 t−3、t−2、t−1 和 t 对于前三个时间戳对应的样本，它们负责循环生成BEV特征 {Bt−3,Bt−2,Bt−1} ，此过程不需要梯度计算。对于时间戳为 t−3 的第一个样本，由于没有之前的 BEV 特征，时间自注意力会退化为普通自注意力。在时间戳 t 时，模型结合多摄像头输入和先前的 BEV 特征 Bt−1 来生成 BEV 特征 Bt ，从而使 Bt 包含这 4 个样本所包含的时空信息。最后，我们将 BEV 特征 Bt 输入到检测和分割模块中，并计算相应的损失函数。

Inference Phase. During the inference phase, we evaluate each frame of the video sequence in chronological order. The BEV features of the previous timestamp are saved and used for the next, and this online inference strategy is time-efficient and consistent with practical applications. Although we utilize temporal information, our inference speed is still comparable with other methods [45, 47].
推理阶段。在推理阶段，我们按时间顺序评估视频序列中的每一帧。前一时间戳对应的 BEV 特征会被保存下来并用于后续帧的处理。这种在线推理方式不仅效率很高，而且符合实际应用需求。尽管我们利用了时间信息，但我们的推理速度仍与其他方法相当 [45, 47]。

4 Experiments

4 实验

4.1 Datasets

4.1 数据集

We conduct experiments on two challenging public autonomous driving datasets, namely nuScenes dataset [4] and Waymo open dataset [40].
我们在两个具有挑战性的公共自动驾驶数据集上进行了实验，分别是 nuScenes 数据集 [4] 和 Waymo 开放数据集 [40]。

The nuScenes dataset [4] contains 1000 scenes of roughly 20s duration each, and the key samples are annotated at 2Hz. Each sample consists of RGB images from 6 cameras and has 360° horizontal FOV. For the detection task, there are 1.4M annotated 3D bounding boxes from 10 categories. We follow the settings in [32] to perform BEV segmentation task. This dataset also provides the official evaluation metrics for the detection task. The mean average precision (mAP) of nuScenes is computed using the center distance on the ground plane rather than the 3D Intersection over Union (IoU) to match the predicted results and ground truth. The nuScenes metrics also contain 5 types of true positive metrics (TP metrics), including ATE, ASE, AOE, AVE, and AAE for measuring translation, scale, orientation, velocity, and attribute errors, respectively. The nuScenes also defines a nuScenes detection score (NDS) as NDS=110[5mAP+∑mTP∈𝕋ℙ(1−min(1,mTP))] to capture all aspects of the nuScenes detection tasks.
nuScenes 数据集 [4] 包含 1000 个场景，每个场景时长约 20 秒。关键样本的标注频率为 2Hz。每个样本包含来自 6 台摄像机的 RGB 图像，具有 360°的水平视场角。在检测任务中，该数据集包含 10 个类别的 140 万个标注过的 3D 边界框。我们按照 [32] 中的设置来执行 BEV分割任务。该数据集还提供了检测任务的官方评估指标。nuScenes 的均值精度(mAP)是依据地面平面上的中心距离来计算的，而非 3D 交并比(IoU)，以此使预测结果与真实值更匹配。nuScenes 的评估指标还包括 5 种真阳性指标(TP 指标)，分别是 ATE、ASE、AOE、AVE 和 AAE，用于分别衡量平移、尺度、方向、速度和属性方面的误差。nuScenes 还定义了 nuScenes 检测得分(NDS)，以全面评估 nuScenes 检测任务的各项表现。

Waymo Open Dataset [40] is a large-scale autonomous driving dataset with 798 training sequences and 202 validation sequences. Note that the five images at each frame provided by Waymo have only about 252° horizontal FOV, but the provided annotated labels are 360° around the ego car. We remove these bounding boxes that can not be visible on any images in training and validation sets. Due to the Waymo Open Dataset being large-scale and high-rate [34], we use a subset of the training split by sampling every 5th frame from the training sequences and only detect the vehicle category. We use the thresholds of 0.5 and 0.7 for 3D IoU to compute the mAP on Waymo dataset.
Waymo Open Dataset [40] 是一个大规模的自动驾驶数据集，包含 798 个训练序列和 202 个验证序列。需要注意的是，Waymo 提供的每帧图像的水平视场角仅为约 252°，但标注的标签涵盖了以自动驾驶车辆为中心的 360°范围。我们移除了那些在训练集和验证集中任何图像上都不可见的边界框。由于 Waymo Open Dataset 规模庞大且数据更新频率高 [34]，我们通过每隔 5 帧抽取一帧的方式对训练数据进行了子集处理，同时仅检测车辆类别。在 Waymo 数据集上，我们使用 0.5 和 0.7 这两个阈值来计算 3D IoU，从而得出 mAP 值。

4.2 Experimental Settings

4.2 实验设置

Following previous methods [45, 47, 31], we adopt two types of backbone: ResNet101-DCN [15, 12] that initialized from FCOS3D [45] checkpoint, and VoVnet-99 [21] that initialized from DD3D [31] checkpoint. By default, we utilize the output multi-scale features from FPN [23] with sizes of 1/16, 1/32, 1/64 and the dimension of C=256 . For experiments on nuScenes, the default size of BEV queries is 200×200, the perception ranges are [−51.2m, 51.2m] for the X and Y axis and the size of resolution s of BEV’s grid is 0.512m. We adopt learnable positional embedding for BEV queries. The BEV encoder contains 6 encoder layers and constantly refines the BEV queries in each layer. The input BEV features Bt−1 for each encoder layer are the same and require no gradients. For each local query, during the spatial cross-attention module implemented by deformable attention mechanism, it corresponds to Nref=4 target points with different heights in 3D space, and the predefined height anchors are sampled uniformly from −5 meters to 3 meters. For each reference point on 2D view features, we use four sampling points around this reference point for each head. By default, we train our models with 24 epochs, a learning rate of 2×10−4.
沿用之前的方法 [45, 47, 31]，我们采用了两种类型的骨干网络：一种是从 FCOS3D[45] 检查点初始化的 ResNet101-DCN[15, 12]；另一种是从 DD3D[31] 检查点初始化的 VoVnet-99[21]。默认情况下，我们使用 FPN[23] 产生的多尺度特征，这些特征的大小分别为 1/16 、1/32 、1/64 ，维度为 C=256 。在 nuScenes 数据集上的实验中，BEV 查询的默认大小为 200×200 ，感知范围为XY方向上[ − 51.2 米, 51.2 米]。BEV 网格的分辨率大小 s 为 0.512 米。我们采用可学习的位置嵌入机制来处理 BEV 查询。BEV 编码器包含 6 个编码层，每个层都会对 BEV 查询进行持续优化。每个编码层的输入 BEV 特征 Bt−1 是相同的，因此不需要计算梯度。在通过可变形注意力机制实现的空间交叉注意力模块中，每个局部查询对应于 3D 空间中不同高度的 Nref=4 目标点。预定义的高度锚点均匀地采样自 −5 米到 3 米之间。对于2D 视图特征上的每个参考点，我们在该参考点周围各使用 4 个采样点，每个注意力头都使用这些采样点。默认情况下，我们的模型训练周期为 24 个epoch，学习率为 2×10−4 。

For experiments on Waymo, we change a few settings. Due to the camera system of Waymo can not capture the whole scene around the ego car [40], the default spatial shape of BEV queries is 300×220, the perception ranges are [−35.0m, 75.0m] for the X-axis and [−75.0m, 75.0m] for the Y-axis. The size of resolution s of each gird is 0.5m. The ego car is at (70, 150) of the BEV.
在 Waymo 上进行实验时，我们需要对部分设置进行修改。由于 Waymo 的摄像头系统无法捕捉到自动驾驶车辆周围的完整场景 [40]，因此 BEV 查询的默认空间形状为 300×220 。在 X 轴上，感知范围为[ − 35.0 米, 75.0 米]；在 Y 轴上，感知范围为[ − 75.0 米, 75.0 米]。每个网格的分辨率大小为 s 0.5 米。自动驾驶车辆位于 BEV 坐标系的(70, 150)位置。

Baselines. To eliminate the effect of task heads and compare other BEV generating methods fairly, we use VPN [30] and Lift-Splat [32] to replace our BEVFormer and keep task heads and other settings the same. We also adapt BEVFormer into a static model called BEVFormer-S via adjusting the temporal self-attention into a vanilla self-attention without using history BEV features.
基准测试。为消除任务头的影响并公平地比较其他 BEV 生成方法，我们使用 VPN [30] 和 Lift-Splat [32] 来替代我们的 BEVFormer，同时保持任务头及其他设置不变。我们还通过将时间自注意力调整为不使用历史 BEV 特征的普通自注意力，将 BEVFormer 改造成名为 BEVFormer-S 的静态模型。

Table 1: 3D detection results on nuScenes test set. ∗ notes that VoVNet-99 (V2-99) [21] was pre-trained on the depth estimation task with extra data [31]. “BEVFormer-S” does not leverage temporal information in the BEV encoder. “L” and “C” indicate LiDAR and Camera, respectively.
表 1：nuScenes 测试集上的 3D 检测结果。***需要说明的是，VoVNet-99 (V2-99)[21]是在包含额外数据[31]的深度估计任务上预训练得到的。“BEVFormer-S”在 BEV 编码器中未利用时间信息。“L”和“C”分别代表激光雷达和摄像头。

Method 方法	Modality 模式	Backbone 主干	NDS↑	mAP↑	mATE↓	mASE↓	mAOE↓	mAVE↓	mAAE↓
SSN [55]	L	-	0.569	0.463	-	-	-	-	-
CenterPoint-Voxel [52]	L	-	0.655	0.580	-	-	-	-	-
PointPainting [43]	L&C	-	0.581	0.464	0.388	0.271	0.496	0.247	0.111
FCOS3D [45]	C	R101	0.428	0.358	0.690	0.249	0.452	1.434	0.124
PGD [44]	C	R101	0.448	0.386	0.626	0.245	0.451	1.509	0.127
BEVFormer-S	C	R101	0.462	0.409	0.650	0.261	0.439	0.925	0.147
BEVFormer	C	R101	0.535	0.445	0.631	0.257	0.405	0.435	0.143
DD3D [31]	C	V2-99∗ V2-99*	0.477	0.418	0.572	0.249	0.368	1.014	0.124
DETR3D [47]	C	V2-99∗ V2-99*	0.479	0.412	0.641	0.255	0.394	0.845	0.133
BEVFormer-S	C	V2-99∗	0.495	0.435	0.589	0.254	0.402	0.842	0.131
BEVFormer	C	V2-99∗ V2-99*	0.569	0.481	0.582	0.256	0.375	0.378	0.126

Table 2: 3D detection results on nuScenes val set. “C” indicates Camera.
表 2：在 nuScenes 验证集上的 3D 检测结果。 “C”表示摄像头。

Method 方法	Modality 模式	Backbone 主干	NDS↑	mAP↑	mATE↓	mASE↓	mAOE↓	mAVE↓	mAAE↓
FCOS3D [45]	C	R101	0.415	0.343	0.725	0.263	0.422	1.292	0.153
PGD [44]	C	R101	0.428	0.369	0.683	0.260	0.439	1.268	0.185
DETR3D [47]	C	R101	0.425	0.346	0.773	0.268	0.383	0.842	0.216
BEVFormer-S	C	R101	0.448	0.375	0.725	0.272	0.391	0.802	0.200
BEVFormer	C	R101	0.517	0.416	0.673	0.274	0.372	0.394	0.198

Table 3: 3D detection results on Waymo val set under Waymo evaluation metric and nuScenes evaluation metric. “L1” and “L2” refer “LEVEL_1” and “LEVEL_2” difficulties of Waymo [40]. *: Only use the front camera and only consider object labels in the front camera’s field of view (50.4°). †: We compute the NDS score by setting ATE and AAE to be 1. “L” and “C” indicate LiDAR and Camera, respectively.
表 3：在 Waymo 评估指标和 nuScenes 评估指标下，Waymo val 数据集上的 3D 检测结果。“L1”和“L2”分别对应 Waymo [40] 中的“LEVEL_1”和“LEVEL_2”难度级别。*: 仅使用前置摄像头，并仅考虑该摄像头视野范围（50.4°）内的物体标签。†：我们将 ATE 和 AAE 均设为 1 来计算 NDS 得分。“L”和“C”分别代表激光雷达和摄像头。

Method 方法/方式	Modality 模式	Waymo Metrics Waymo 指标				Nuscenes Metrics Nuscenes 指标
		IoU=0.5		IoU=0.7		NDS†↑	AP↑	ATE↓	ASE↓	AOE↓
		L1/APH	L2/APH	L1/APH	L2/APH	NDS†↑	AP↑	ATE↓	ASE↓	AOE↓
PointPillars [20]	L	0.866	0.801	0.638	0.557	0.685	0.838	0.143	0.132	0.070
DETR3D [47]	C	0.220	0.216	0.055	0.051	0.394	0.388	0.741	0.156	0.108
BEVFormer	C	0.280	0.241	0.061	0.052	0.426	0.440	0.679	0.157	0.101
CaDNN∗ [34]	C	0.175	0.165	0.050	0.045	-	-	-	-	-
BEVFormer∗	C	0.308	0.277	0.077	0.069	-	-	-	-	-

Table 4: 3D detection and map segmentation results on nuScenes val set. Comparison of training segmentation and detection tasks jointly or not. *: We use VPN [30] and Lift-Splat [32] to replace our BEV encoder for comparison, and the task heads are the same. †: Results from their paper.
表 4：在 nuScenes val 数据集上的 3D 检测与地图分割结果。比较了联合训练分割与检测任务与分别训练这两种任务的效果。*: 为便于比较，我们使用 VPN [30] 和 Lift-Splat [32] 来替代我们的 BEV 编码器，任务结构保持不变。†：引自他们论文中的结果。

Method 方法/方式	Task Head 任务头		3D Detection 3D 检测		BEV Segmentation (IoU) BEV分割（IoU）
Method 方法/方式	Det	Seg 周一	NDS↑	mAP↑	Car 汽车	Vehicles 车辆	Road 道路	Lane 车道
Lift-Splat† [32]	✗	✓	-	-	32.1	32.1	72.9	20.0
FIERY† [18]	✗	✓	-	-	-	38.2	-	-
VPN∗ [30]	✓	✗	0.333	0.253	-	-	-	-
VPN∗	✗	✓	-	-	31.0	31.8	76.9	19.4
VPN∗ VPNc0>	✓	✓	0.334	0.257	36.6	37.3	76.0	18.0
Lift-Splat∗ Lift-Splat*	✓	✗	0.397	0.348	-	-	-	-
Lift-Splat∗ Lift-Splatc0	✗	✓	-	-	42.1	41.7	77.7	20.0
Lift-Splat∗ Lift-Splatc0	✓	✓	0.410	0.344	43.0	42.8	73.9	18.3
BEVFormer-S	✓	✗	0.448	0.375	-	-	-	-
BEVFormer-S	✗	✓	-	-	43.1	43.2	80.7	21.3
BEVFormer-S	✓	✓	0.453	0.380	44.3	44.4	77.6	19.8
BEVFormer	✓	✗	0.517	0.416	-	-	-	-
BEVFormer	✗	✓	-	-	44.8	44.8	80.1	25.7
BEVFormer	✓	✓	0.520	0.412	46.8	46.7	77.5	23.9

4.3 3D Object Detection Results

4.3 3D 物体检测结果

We train our model on the detection task with the detection head only for fairly comparing with previous state-of-the-art 3D object detection methods. In Tab. 1 and Tab. 2, we report our main results on nuScenes test and val splits. Our method outperforms previous best method DETR3D [47] over 9.2 points on val set (51.7% NDS vs. 42.5% NDS), under fair training strategy and comparable model scales. On the test set, our model achieves 56.9% NDS without bells and whistles, 9.0 points higher than DETR3D (47.9% NDS). Our method can even achieve comparable performance to some LiDAR-based baselines such as SSN (56.9% NDS) [55] and PointPainting (58.1% NDS) [43].
我们在仅使用检测头的检测任务上对模型进行训练，以便与现有的最先进 3D 物体检测方法进行公平比较。在表 1 和表 2 中，我们汇报了在 nuScenes 的测试和验证数据集上的主要结果。在公平的训练策略和相当的模型规模下，我们的方法在验证数据集上的表现比现有最佳方法 DETR3D 高出 9.2 分（NDS 准确率为 51.7% 对比 42.5%）。在测试数据集上，我们的模型在未使用任何额外功能的情况下实现了 56.9%的 NDS 准确率，比 DETR3D 高出 9.0 分（DETR3D 为 47.9%）。我们的方法甚至能与一些基于 LiDAR 的基准方法相媲美，例如 SSN（56.9% NDS）[55] 和 PointPainting（58.1% NDS）[43]。

Previous camera-based methods [47, 31, 45] were almost unable to estimate the velocity, and our method demonstrates that temporal information plays a crucial role in velocity estimation for multi-camera detection. The mean Average Velocity Error (mAVE) of BEVFormer is 0.378 m/s on the test set, outperforming other camera-based methods by a vast margin and approaching the performance of LiDAR-based methods [43].
以往基于摄像头的算法 [47, 31, 45] 几乎无法准确估计速度。我们的研究证明，在多摄像头检测中，时间信息对速度估计起着至关重要的作用。在测试数据集上，BEVFormer 的平均速度误差为 0.378 米/秒，远远优于其他基于摄像头的算法，其性能甚至接近基于 LiDAR 的算法 [43]。

We also conduct experiments on Waymo, as shown in Tab. 3. Following [34], we evaluate the vehicle category with IoU criterias of 0.7 and 0.5. In addition, We also adopt the nuScenes metrics to evaluate the results since the IoU-based metrics are too challenging for camera-based methods. Due to a few camera-based works reported results on Waymo, we also use the official codes of DETR3D to perform experiments on Waymo for comparison. We can observe that BEVFormer outperforms DETR3D by Average Precision with Heading information (APH) [40] of 6.0% and 2.5% on LEVEL_1 and LEVEL_2 difficulties with IoU criteria of 0.5. On nuScenes metrics, BEVFormer outperforms DETR3D with a margin of 3.2% NDS and 5.2% AP. We also conduct experiments on the front camera to compare BEVFormer with CaDNN [34], a monocular 3D detection method that reported their results on the Waymo dataset. BEVFormer outperforms CaDNN with APH of 13.3% and 11.2% on LEVEL_1 and LEVEL_2 difficulties with IoU criteria of 0.5.
如表 3 所示，我们也在 Waymo 上进行了实验。遵循 [34] 中的方法，我们使用 0.7 和 0.5 的 IoU 标准来评估车辆类别。此外，由于基于 IoU 的评估标准对基于摄像机的方法来说难度较大，我们还采用了 nuScenes 指标来评估结果。鉴于仅有少数基于摄像机的研究在 Waymo 上取得了成果，我们还使用了 DETR3D 的官方代码在 Waymo 上进行实验以作对比。实验结果显示：在 IoU 为 0.5 的条件下，BEVFormer 在 LEVEL_1 和 LEVEL_2 难度下的平均精度加航向信息指标(APH)[40] 分别高出 DETR3D 6.0%和 2.5%。在 nuScenes 指标上，BEVFormer 的 NDS 和 AP 分别比 DETR3D 高出 3.2%和 5.2%。我们还在前置摄像头上进行了实验，将 BEVFormer 与 CaDNN[34] 进行了对比。CaDNN 是一种单目 3D 检测方法，其研究成果发表在 Waymo 数据集上。在 IoU 标准为 0.5 的条件下，BEVFormer 在 LEVEL_1 和 LEVEL_2 难度下的平均精度分别达到了 13.3%和 11.2%，优于 CaDNN。

4.4 Multi-tasks Perception Results

4.4 多任务感知测试结果

We train our model with both detection and segmentation heads to verify the learning ability of our model for multiple tasks, and the results are shown in Tab. 4. While comparing different BEV encoders under same settings, BEVFormer achieves higher performances of all tasks except for road segmentation results is comparable with BEVFormer-S. For example, with joint training, BEVFormer outperforms Lift-Splat∗ [32] by 11.0 points on detation task (52.0% NDS v.s. 41.0% NDS) and IoU of 5.6 points on lane segmentation (23.9% v.s. 18.3%). Compared with training tasks individually, multi-task learning saves computational cost and reduces the inference time by sharing more modules, including the backbone and the BEV encoder. In this paper, we show that the BEV features generated by our BEV encoder can be well adapted to different tasks, and the model training with multi-task heads performs even better on detection tasks and vehicles segmentation. However, the jointly trained model does not perform as well as individually trained models for road and lane segmentation, which is a common phenomenon called negative transfer [11, 13] in multi-task learning.
我们让模型同时具备检测和分割功能，以此验证其在多种任务上的学习能力。相关结果详见表 4。在相同设置下对比各种 BEV 编码器时，除道路分割任务外，BEVFormer 在所有任务上的表现均更优；道路分割任务的性能与 BEVFormer-S 相当。例如，在联合训练中，BEVFormer 在检测任务上的得分比 Lift-Splat∗ [32] 高出 11.0 分（NDS 准确率：52.0% vs 41.0%），车道分割任务的 IoU 值高出 5.6 点（准确率：23.9% vs 18.3%）。与分别训练各任务相比，多任务学习通过共享包括主干网络和 BEV 编码器在内的多个模块，有效降低了计算成本和推理时间。在本文中，我们证明了由 BEV 编码器生成的 BEV 特征能够很好地适应不同任务。采用多任务结构进行模型训练后，该模型在目标检测和车辆分割任务上的表现更为出色。不过，对于道路和车道分割任务，联合训练的模型表现不如单独训练的模型。这种现象在多任务学习中很常见，被称为负迁移 [11, 13]。

Table 5: The detection results of different methods with various BEV encoders on nuScenes val set. “Memory” is the consumed GPU memory during training. *: We use VPN [30] and Lift-Splat [32] to replace BEV encoder of our model for comparison. †: We train BEVFormer-S using global attention in spatial cross-attention, and the model is trained with fp16 weights. In addition, we only adopt single-scale features from the backbone and set the spatial shape of BEV queries to be 100×100 to save memory. ‡: We degrade the interaction targets of deformable attention from the local region to the reference points only by removing the predicted offsets and weights.
表 5：在 nuScenes 验证集上，不同方法使用各种 BEV 编码器时的检测结果。“Memory”表示训练过程中消耗的 GPU 内存。 *: 为便于对比，我们使用 VPN[30]和 Lift-Splat[32]来替代模型中的 BEV 编码器。†: 我们在空间交叉注意力机制中采用了全局注意力来训练 BEVFormer-S，模型使用 fp16 格式的权重进行训练。此外，我们仅使用主干网络输出的单一尺度特征，并将 BEV 查询的空间尺寸设置为 100×100×100×100×100，以节省内存。 ‡‡{\ddagger}：我们通过移除预测得到的偏移量和权重，将可变形注意力机制的交互目标从局部区域限制在参考点上。

Method 方法	Attention 注意	NDS↑	mAP↑	mATE↓	mAOE↓	#Param.	FLOPs	Memory 内存
VPN∗ [30]	-	0.334	0.252	0.926	0.598	111.2M	924.5G	∼20G
List-Splat∗ [32] List-Splat* ∗[32]	-	0.397	0.348	0.784	0.537	74.0M	1087.7G	∼20G
BEVFormer-S†	Global 全球	0.404	0.325	0.837	0.442	62.1M	1245.1G	∼36G
BEVFormer-S‡ BEVFormer-S‡	Points 积分/分数	0.423	0.351	0.753	0.442	68.1M	1264.3G	∼20G
BEVFormer-S	Local 本地/当地	0.448	0.375	0.725	0.391	68.7M	1303.5G	∼20G

Refer to caption

Figure 3: The detection results of subsets with different visibilities. We divide the nuScenes val set into four subsets based on the visibility that {0-40%, 40-60%, 60-80%, 80-100%} of objects can be visible. (a): Enhanced by the temporal information, BEVFormer has a higher recall on all subsets, especially on the subset with the lowest visibility (0-40%). (b), (d) and (e): Temporal information benefits translation, orientation, and velocity accuracy. (c) and (f): The scale and attribute error gaps among different methods are minimal. Temporal information does not work to benefit an object’s scale prediction.
图 3：不同可见度下各子集的检测结果。我们将 nuScenes 验证集根据物体的可见度分为四个子集：{0-40%、40-60%、60-80%、80-100%}。(a)：借助时间信息，BEVFormer 在所有子集上的召回率均有所提升，尤其是在可见度最低的子集（0-40%）中表现更佳。(b)、(d)和(e)：时间信息有助于提升物体的平移、旋转和速度预测精度。(c)和(f)：不同方法在尺度与属性预测方面的误差差异很小。时间信息对物体的尺度预测并无显著提升作用。

4.5 Ablation Study

4.5 消融研究

To delve into the effect of different modules, we conduct ablation experiments on nuScenes val set with detection head. More ablation studies are in Appendix.
为探究不同模块的效应，我们在带有检测头的 nuScenes val 数据集上进行了消融实验。更多消融研究内容见附录。

Effectiveness of Spatial Cross-Attention. To verify the effect of spatial cross-attention, we use BEVFormer-S to perform ablation experiments to exclude the interference of temporal information, and the results are shown in Tab. 5. The default spatial cross-attention is based on deformable attention. For comparison, we also construct two other baselines with different attention mechanisms: (1) Using the global attention to replace deformable attention; (2) Making each query only interact with its reference points rather than the surrounding local regions, and it is similar to previous methods [36, 37]. For a broader comparison, we also replace the BEVFormer with the BEV generation methods proposed by VPN [30] and Lift-Spalt [32]. We can observe that deformable attention significantly outperforms other attention mechanisms under a comparable model scale. Global attention consumes too much GPU memory, and point interaction has a limited receptive field. Sparse attention achieves better performance because it interacts with a priori determined regions of interest, balancing receptive field and GPU consumption.
空间交叉注意力的有效性。为验证空间交叉注意力机制的效果，我们使用 BEVFormer-S 进行了消融实验，以排除时间信息的干扰。实验结果如表 5 所示。默认的空间交叉注意力机制基于可变形注意力。作为对比，我们还构建了两种采用不同注意力机制的基线模型：(1)用全局注意力替代可变形注意力；(2)使每个查询仅与其参考点交互，而不与周围区域交互，该方式与先前的研究 [36, 37] 类似。为了进行更全面的比较，我们还将 BEVFormer 替换为 VPN 提出的 BEV 生成方法 [30] 以及 Lift-Spalt 方法 [32]。实验结果表明，在模型规模相当的情况下，可变形注意力的性能显著优于其他机制。全局注意力会占用过多 GPU 内存，而点交互方式的感知范围有限。稀疏注意力则表现更好，因为它仅与预先确定的感兴趣区域进行交互，从而在感知范围和 GPU 资源消耗之间取得了平衡。

Effectiveness of Temporal Self-Attention. From Tab. 1 and Tab. 4, we can observe that BEVFormer outperforms BEVFormer-S with remarkable improvements under the same setting, especially on challenging detection tasks. The effect of temporal information is mainly in the following aspects: (1) The introduction of temporal information greatly benefits the accuracy of the velocity estimation; (2) The predicted locations and orientations of the objects are more accurate with temporal information; (3) We obtain higher recall on heavily occluded objects since the temporal information contains past objects clues, as showed in Fig. 3. To evaluate the performance of BEVFormer on objects with different occlusion levels, we divide the validation set of nuScenes into four subsets according to the official visibility label provided by nuScenes. In each subset, we also compute the average recall of all categories with a center distance threshold of 2 meters during matching. The maximum number of predicted boxes is 300 for all methods to compare recall fairly. On the subset that only 0-40% of objects can be visible, the average recall of BEVFormer outperforms BEVFormer-S and DETR3D with a margin of more than 6.0%.
时间自注意力机制的有效性。从 Tab.1 和 Tab.4 可以看出，在相同设置下，BEVFormer 的表现明显优于 BEVFormer-S，尤其是在那些具有挑战性的检测任务中。时间信息的作用主要体现在以下几个方面：(1)时间信息的引入显著提升了速度估计的准确性；(2)利用时间信息后，物体位置的预测更为精确；(3)如图 3 所示，由于时间信息包含了关于过去物体的线索，因此在物体被严重遮挡的情况下，BEVFormer 的召回率更高。为了评估 BEVFormer 在不同遮挡程度物体上的表现，我们根据 nuScenes 提供的官方可见性标签，将验证集划分为四个子集。在每个子集中，我们在匹配过程中以 2 米的中心距离阈值来计算所有类别的平均召回率。为确保公平比较，所有方法的预测框数量上限均为 300 个。在只有 0-40%的物体可见的子集中，BEVFormer 的平均召回率比 BEVFormer-S 和 DETR3D 高出 6.0%以上。

Model Scale and Latency. We compare the performance and latency of different configurations in Tab. 6. We ablate the scales of BEVFormer in three aspects, including whether to use multi-scale view features, the shape of BEV queries, and the number of layers, to verify the trade-off between performance and inference latency. We can observe that configuration C using one encoder layer in BEVFormer achieves 50.1 % NDS and reduces the latency of BEVFormer from the original 130ms to 25ms. Configuration D, with single-scale view features, smaller BEV size, and only 1 encoder layer, consumes only 7ms during inference, although it loses 3.9 points compared to the default configuration. However, due to the multi-view image inputs, the bottleneck that limits the efficiency lies in the backbone, and efficient backbones for autonomous driving deserve in-depth study. Overall, our architecture can adapt to various model scales and be flexible to trade off performance and efficiency.
模型规模与延迟。我们在表 6 中比较了不同配置下的性能与延迟。我们从三个方面调整了 BEVFormer 的规模：是否使用多尺度视图特征、BEV 查询的形状以及层数，以此来验证性能与推理延迟之间的权衡关系。实验结果表明：采用单个编码器的配置 C，其 NDS 值达到了 50.1%，同时将 BEVFormer 的延迟从原来的 130 毫秒降低到了 25 毫秒。配置 D 虽然仅使用单个编码器、较小的 BEV 尺寸和单尺度视图特征，其推理延迟仅为 7 毫秒，但性能较默认配置下降了 3.9 个百分点。不过，由于采用了多视图图像输入，限制效率的瓶颈在于网络主干部分；因此，针对自动驾驶场景的高效网络结构值得进一步研究。总体而言，我们的架构能够适应不同的模型规模，并灵活地在性能与效率之间进行权衡。

Table 6: Latency and performance of different model configurations on nuScenes val set. The latency is measured on a V100 GPU, and the backbone is R101-DCN. The input image shape is 900×1600. “MS” notes multi-scale view features.
表 6：不同模型配置在 nuScenes 验证集上的延迟与性能表现。测试在 V100 GPU 上进行，模型主干结构为 R101-DCN。输入图像尺寸为 900×1600。标注“MS”表示多尺度视图特征。

Method 方法/方式	Scale of BEVFormer BEVFormer 的规模			Latency (ms) 延迟（毫秒）			FPS	NDS↑	mAP↑
Method 方法/方式	MS	BEV	#Layer	Backbone 主干	BEVFormer	Head 头	FPS	NDS↑	mAP↑
BEVFormer	✓	200×200	6	391	130	19	1.7	0.517	0.416
A	✗	200×200	6	387	87	19	1.9	0.511	0.406
B	✓	100×100	6	391	53	18	2.0	0.504	0.402
C	✓	200×200	1	391	25	19	2.1	0.501	0.396
D	✗	100×100	1	387	7	18	2.3	0.478	0.374

4.6 Visualization Results

4.6 可视化结果

We show the detection results of a complex scene in Fig. 4. BEVFormer produces impressive results except for a few mistakes in small and remote objects. More qualitative results are provided in Appendix.
我们在图 4 中展示了复杂场景的检测结果。除在小型和远处物体上存在少量错误外，BEVFormer 取得了令人印象深刻的结果。更详细的定性分析结果见附录。

Refer to caption

Figure 4: Visualization results of BEVFormer on nuScenes val set. We show the 3D bboxes predictions in multi-camera images and the bird’s-eye-view.
图 4：BEVFormer 在 nuScenes 验证集上的可视化结果。我们展示了多摄像头图像中的 3D 边界框预测结果以及鸟瞰图。

5 Discussion and Conclusion

5 讨论与结论

In this work, we have proposed BEVFormer to generate the bird’s-eye-view features from multi-camera inputs. BEVFormer can efficiently aggregate spatial and temporal information and generate powerful BEV features that simultaneously support 3D detection and map segmentation tasks.
在这项工作中，我们提出了 BEVFormer，用于从多摄像头输入中生成鸟瞰图特征。BEVFormer 能够高效地整合空间和时间信息，生成出色的鸟瞰图特征，这些特征同时适用于 3D 检测和地图分割任务。

Limitations. At present, the camera-based methods still have a particular gap with the LiDAR-based methods in effect and efficiency. Accurate inference of 3D location from 2D information remains a long-stand challenge for camera-based methods.
局限性。目前，基于摄像头的算法在效果和效率方面仍与基于 LiDAR 的算法存在明显差距。如何从二维信息准确推断出三维位置，依然是基于摄像头的算法面临的长期挑战。

Broader impacts. BEVFormer demonstrates that using spatiotemporal information from the multi-camera input can significantly improve the performance of visual perception models. The advantages demonstrated by BEVFormer, such as more accurate velocity estimation and higher recall on low-visible objects, are essential for constructing a better and safer autonomous driving system and beyond. We believe BEVFormer is just a baseline of the following more powerful visual perception methods, and vision-based perception systems still have tremendous potential to be explored.
更广泛的影响。 BEVFormer 表明，利用多摄像头输入中的时空信息能够显著提升视觉感知模型的性能。BEVFormer 所展现的优势，如更精确的速度估计以及在低可见度物体识别上的更高准确率，对于构建更出色、更安全的自动驾驶系统至关重要。我们认为，BEVFormer 仅仅是未来更强大视觉感知方法的基准水平，基于视觉的感知系统仍拥有巨大的开发潜力。

声明：本文主要取材于ar5iv.labs，本人对全文中文翻译部分进行了细致全面的校验和编辑工作。