基于Swin Transformer和双层路由注意力的多标签图像分类算法

张震; 王贺; 宋宏旭

基于Swin Transformer和双层路由注意力的多标签图像分类算法

Multi-Label Image Classification Algorithm Based on Transformer

摘要

摘要: 图像分类是图像处理中一项基础而又重要的工作。单一标签的图像分类已经无法满足人们的需求，研究者们开始关注于多标签图像分类。本文提出了一种Swin Transformer进行特征提取，由双层路由注意力模块进行特征处理的多标签图像分类框架。Swin Transformer通过分层结构提取多尺度信息，在多目标和更细粒度的图像识别方面优于Vision Transformer；双层路由注意力模块能够实现更灵活的计算分配和内容感知，可根据输入图像的特征自适应地调整注意力权重，灵活地控制注意力的强度和范围。模型在COCO数据集上平均精度均值为87.3，在VOC2007数据集上平均精度均值为96.7，一定程度上提高了多标签图像分类的精度。

Abstract: Image classification is a basic and important direction in image processing. Since there is not only a single label value on an image, the current image classification can no longer meet people’s needs, and multi-label image classification came into being. This paper proposes a multi-label image classification framework using Swin Transformer for feature extraction and a two-layer routing attention module for feature processing. Swin Transformer extracts multi-scale information through a hierarchical structure, and is superior to Vision Transformer in terms of multi-target and finer-grained image recognition. The duallayer routing attention module enables more flexible computation allocation and content awareness. The dynamic attention mechanism adaptively adjusts the attention weight according to the characteristics of the input image, so that different positions or features can be given different levels of attention, and the intensity and range of attention can be flexibly controlled by adjusting the dynamic attention. The average precision of the model on the COCO dataset is 87. 3, and the average precision on the VOC2007 dataset is 96. 7, which improves the accuracy of multi-label image classification to a certain extent.

HTML全文

参考文献(14)

施引文献

资源附件(0)