Abstract:
Image classification is a basic and important direction in image processing. Since there is not only a single label value on an image, the current image classification can no longer meet people’s needs, and multi-label image classification came into being. This paper proposes a multi-label image classification framework using Swin Transformer for feature extraction and a two-layer routing attention module for feature processing. Swin Transformer extracts multi-scale information through a hierarchical structure, and is superior to Vision Transformer in terms of multi-target and finer-grained image recognition. The duallayer routing attention module enables more flexible computation allocation and content awareness. The dynamic attention mechanism adaptively adjusts the attention weight according to the characteristics of the input image, so that different positions or features can be given different levels of attention, and the intensity and range of attention can be flexibly controlled by adjusting the dynamic attention. The average precision of the model on the COCO dataset is 87. 3, and the average precision on the VOC2007 dataset is 96. 7, which improves the accuracy of multi-label image classification to a certain extent.