基于改进Swin Transformer网络的玉米干旱图像分类方法

德志鹏; 武荣盛; 刘霞霞; 邓晓东; 李兴华; 郑诗然; 冯旭宇; 吴昊; 袁睿晗

doi:10.11975/j.issn.1002-6819.202506209

基于改进Swin Transformer网络的玉米干旱图像分类方法

Maize drought image classification method based on improved Swin Transformer network

摘要

摘要: 精准识别玉米干旱胁迫程度对优化灌溉决策、降低产量损失具有重要的理论意义与应用价值。针对当前玉米干旱表型分类存在的多源田间图像尺度差异大、背景复杂及光照变化干扰等核心问题，该研究提出基于Swin Transformer的多尺度空洞融合注意力网络（Swin Transformer with multi-scale dilated fusion attention，SWT-MDFA），用于多生育期、多胁迫等级的玉米干旱图像分类。该网络利用残差后归一化、缩放余弦注意力与对数间隔连续位置偏差增强玉米干旱表型的稳定建模能力，减轻不同生育阶段表型差异与相邻胁迫等级特征重叠带来的判别不确定性；利用多尺度空洞卷积融合通道与空间注意力模块，提升对叶片卷曲、褪绿、黄化与干枯等关键表型的多尺度特征提取能力。结果表明，SWT-MDFA在多源玉米干旱分类数据集上的准确率、精确率和召回率分别为97.4%、97.4%和97.3%，比Swin Transformer分别提高1.5、2.0和1.1个百分点，与Res2Net-50、Xception、SE-ResNet和DenseNet-121典型骨干网络相比，其准确率、精确率和召回率分别提升2.4～6.2、2.3～7.0和1.7～6.9个百分点。分类可视化的结果表明SWT-MDFA对玉米干旱胁迫下叶片关键表型区域具有稳定而清晰的关注能力。综上，该研究提出的方法能够有效应对复杂田间条件下玉米干旱表型的识别需求，为后续玉米干旱智能监测提供基础的技术支持。

Abstract: Accurate classification of maize drought phenotypes from in-field visible-light imagery is essential for irrigation decision support and yield-loss mitigation, yet the task remains challenging under practical farmland conditions where scale variation, background clutter, and illumination fluctuations jointly degrade feature stability and inter-class separability. To address these difficulties, a multi-source maize drought image dataset was established and an improved Swin Transformer–based framework was developed for nine-class phenotype classification across multiple growth stages and drought severities. The original collection contained 948 images acquired from three channels, including fixed high-definition cameras deployed at agricultural research stations, in-field smartphone photography, and publicly accessible online repositories. After strict manual screening, 132 low-quality or redundant images (13.9% of the originals) were removed, leaving 816 high-quality samples. To enhance robustness to acquisition variability, data augmentation operations—random rotations, salt-and-pepper noise injection, and adaptive histogram equalization—were applied, expanding the dataset to 3 628 images. Each image was annotated into one of nine categories defined by the intersection of three drought severity levels (no drought, mild drought, severe drought) and three growth stages (jointing, tasseling, maturity); the final dataset was split into training, validation, and test subsets in a 7:1:2 ratio (2 542/363/723). As the backbone, Swin Transformer, a hierarchical Vision Transformer variant with shifted-window self-attention, was adopted to balance contextual modeling and computational efficiency for high-resolution field images. Building upon the baseline, three complementary modifications were incorporated at the Transformer block level: 1) residual post-normalization, placing layer normalization after residual connections to stabilize feature scaling in deep layers; 2) scaled cosine attention, replacing dot-product similarity with cosine similarity and a learnable scaling factor to reduce sensitivity to activation magnitude and improve similarity measurement under varying illumination; and 3) log-spaced continuous position bias (Log-CPB), generating continuous relative position biases via logarithmic coordinate mapping to facilitate transfer across different window sizes and image resolutions. In addition, a multi-scale dilated fusion attention (MDFA) module was designed to strengthen drought-relevant representation by integrating four parallel dilated convolution branches (dilation rates 1, 6, 12, and 18) and fusing channel attention and spatial attention to emphasize informative channels and locations associated with leaf curling, chlorosis/yellowing, and drying symptoms. The resulting model was termed Swin Transformer with multi-scale dilated fusion attention (SWT-MDFA). Ablation experiments quantified the contribution of each component on the test set: the baseline Swin Transformer achieved 95.9% accuracy, 95.4% precision, and 96.2% recall; residual post-normalization increased accuracy to 96.1%; scaled cosine attention improved accuracy to 96.2% and recall to 96.6%; Log-CPB mainly benefited recall to 96.8% under complex backgrounds; and MDFA produced the largest single-module gain, reaching 97.0% accuracy with 96.3% precision and 97.2% recall. With all components integrated, SWT-MDFA achieved the best overall performance, attaining 97.4% accuracy, 97.4% precision, and 97.3% recall, corresponding to improvements of 1.5, 2.0, and 1.1 percentage points over the baseline, respectively. Comparative evaluation against representative convolutional backbones—SE-ResNet, DenseNet-121, Xception, and Res2Net-50—confirmed the superiority of SWT-MDFA, whose accuracy exceeded that of Res2Net-50 (95.0%), Xception (93.6%), SE-ResNet (92.4%), and DenseNet-121 (91.2%). In terms of computational cost, SWT-MDFA required 88.9 million parameters and 48.9 giga floating-point operations (FLOPs), indicating that substantial accuracy gains were achieved with a manageable computational budget. Confusion-matrix analysis on the 723-image test set (704 correctly classified, 97.4% overall accuracy) further showed that misclassifications were mainly concentrated between adjacent drought severities or neighboring growth stages, whereas classes with more distinctive phenotypes achieved error-free recognition. Visualization-based analysis using gradient-weighted class activation mapping further suggested that the model consistently concentrated on drought-relevant leaf regions while suppressing irrelevant background responses, supporting interpretable and reliable phenotype classification across growth stages and severity levels.

HTML全文

参考文献(36)

施引文献

资源附件(0)