基于Transformer的多尺度物体检测

侯越千; 张丽红

基于Transformer的多尺度物体检测

Multi-Scale Object Detection Based on Transformer

摘要

摘要: 目前，Transformer基本模型对同一场景内不同尺寸物体的检测能力不足，其主要原因为各层等尺度的输入嵌入无法提取跨尺度特征，导致网络不具备在不同尺度的特征之间建立交互的能力。基于此，提出一种基于Transformer的多尺度物体检测网络，该网络采用跨尺度嵌入层初步对图像特征进行嵌入处理；利用多分支空洞卷积对输入进行下采样，通过调整并行分支的膨胀率使该结构具有多样的感受野；然后，由残差自注意力模块对输出嵌入结果进行处理，为特征图的局部和全局信息构建联系，使注意力计算融入有效的多尺度语义信息，最终实现多尺度物体检测。模型在COCO等数据集上进行训练，实验结果表明该方法与其他物体检测方法相比具有显著优势。

Abstract: The current Transformer basic model is inadequate for detecting objects of different sizes within the same scene. The main reason for this is that the equal-scale input embedding of each layer cannot extract cross-scale features, resulting in a network that does not have the ability to establish interactions between features of different scales. In this paper, we propose a Transformer-based multi-scale object detection network, which uses cross-scale embedding layers to initially embed image features, in which the input is downsampled using multi-branch null convolution, and the structure is made to have diverse sensory fields by adjusting the expansion rate of parallel branches. The output embedding results are then processed by the residual self-attention module to construct links for local and global information of the feature map, so that the attention calculation incorporates effective multi-scale semantic information and finally achieves multi-scale object detection. The models are trained on datasets such as COCO, and the experimental results show that the method has significant advantages over other object detection methods.

HTML全文

参考文献(19)

施引文献

资源附件(0)