Abstract:
The current Transformer basic model is inadequate for detecting objects of different sizes within the same scene. The main reason for this is that the equal-scale input embedding of each layer cannot extract cross-scale features, resulting in a network that does not have the ability to establish interactions between features of different scales. In this paper, we propose a Transformer-based multi-scale object detection network, which uses cross-scale embedding layers to initially embed image features, in which the input is downsampled using multi-branch null convolution, and the structure is made to have diverse sensory fields by adjusting the expansion rate of parallel branches. The output embedding results are then processed by the residual self-attention module to construct links for local and global information of the feature map, so that the attention calculation incorporates effective multi-scale semantic information and finally achieves multi-scale object detection. The models are trained on datasets such as COCO, and the experimental results show that the method has significant advantages over other object detection methods.