基于改进RT-DETR-L模型的复杂环境下荸荠识别

刘浩蓬; 张乐妍; 张国忠; 孔维臻; 宁水仙; 卫佳

doi:10.11975/j.issn.1002-6819.202509034

基于改进RT-DETR-L模型的复杂环境下荸荠识别

Water chestnut recognition in complex environments using the improved RT-DETR-L model

摘要

摘要: 针对采挖机输送装置上泥果混合状态的荸荠在识别时易出现被土壤粘附、被土块遮挡等复杂情况而导致识别精度低的问题，该研究提出一种基于改进RT-DETR-L（real-time detection transformer large）的荸荠识别方法。首先，针对骨干网络特征区分能力不足问题，采用小波增强混合模块（wavelet transform hybrid group block，WTHGBlock）优化骨干网络，提升目标特征判别力；其次，构建多尺度多头动态交互注意力模块（multi-scale head dynamic interaction attention feature integration，MSHD-AIFI），经“局部-中等-全局”三级分支捕捉不同尺寸特征和动态调整权重，强化小目标识别，改善多尺度识别效果；最后，引入荸荠形态感知IoU损失（biqi morphology-aware IoU Loss，BM-IoU Loss），通过形态约束、尺寸加权及核心区域优化，可减少形态相似、遮挡带来的定位偏差，提升识别精度。结果表明，改进后模型RT-DETR-L-WMB的识别精度、召回率和平均精度均值分别为85.2%、92.2%和94.9%，相比基线模型RT-DETR-L，分别提升1.4、1.8和1.6 个百分点；相比YOLO系列主流模型，分别提升2.4～3.9、0.9～1.7和3.4～4.1个百分点；热力图中模型能更精准地聚焦真实目标区域，复杂环境（荸荠密集堆积、荸荠中混有土块）下模型识别结果的漏检率、误检率低，且改进后模型的参数量和识别速度分别为34.7 M、41.9 帧/s，满足荸荠自动化捡拾场景的实时性和轻量化需求。该研究可为荸荠等块茎类作物收获过程的捡拾与分选提供一定的技术支撑。

Abstract: High accuracy is often required to recognize the mud-adhered water chestnuts in the conveying stage of harvesters. However, it is still challenging under complex situations, such as soil adhesion and occlusion by soil clods during identification. In this study, an improved recognition was proposed for the water chestnut using RT-DETR-L (Real-Time Detection Transformer Large). Firstly, a Wavelet Transform Hybrid Group Block (WTHGBlock) was adopted to optimize the backbone network for the discriminative power of the target features. Secondly, a Multi-scale Head Dynamic Interaction Attention Feature Integration module (MSHD-AIFI) was constructed to recognize the small targets for the multi-scale performance. Three-level "local, medium, and global" branches were used to capture the features of varied sizes. The weights were adjusted dynamically. Finally, a Biqi Morphology-aware IoU Loss (BM-IoU Loss) was also introduced to incorporate the morphological constraints, size weighting, and region optimization. The localization deviations were then reduced due to the morphological similarity and occlusion. Each module was verified in the recognition task of the water chestnut under cluttered soil conditions. The RT-DETR-L was used as the baseline model. Three modules were designed for the ablation experiments. In the WTHGBlock module, the precision, recall, and mAP0.5 reached 85.0%, 91.1%, and 94.3%, respectively, which were improved by 1.2, 0.7, and 1.0 percentage points, compared with the baseline model. In the MSHD-AIFI module, the recall and mAP0.5 reached 91.3% and 93.6%, respectively, which were improved by 0.9 and 0.3 percentage points over the baseline, while the precision decreased by 0.7 percentage points. The perception of the small targets and occluded areas was enhanced after multi-scale dynamic attention. But the background noise also led to a slight increase in the false detections. In the BM-IoU Loss module, the precision, recall, and mAP0.5 were 84.4%, 91.7%, and 93.4%, respectively, which were improved by 0.6, 1.3, and 0.1 percentage points over the baseline RT-DETR-L model. When all three modules were integrated into the baseline RT-DETR-L model, the RT-DETR-L-WMB model achieved optimal performance, with the improvements of 1.4, 1.8, and 1.6 percentage points, compared with the baseline model. Among the three modules, the WTHGBlock enhanced the feature discriminability, the MSHD-AIFI module improved the multi-scale coverage, and the BM-IoU Loss was used to improve the localization accuracy. Their functions jointly improved the performance of the recognition in the complex scenes. The RT-DETR-L-WMB model achieved a recognition precision, recall, and mAP0.5 of 85.2%, 92.2%, and 94.9%, respectively. Compared with the mainstream models of the YOLO series, these metrics were improved by 2.4～3.9, 0.9～1.7, and 3.4～4.1 percentage points, respectively. In terms of efficiency, the parameter count of the improved model was reduced by 10.9～23.8 M, whereas the speed of the detection increased by 1.0～6.7 frame/s, compared with the mainstream models of the YOLO series. The performance was significantly enhanced for the higher detection speed with fewer computational resources. The actual target regions were much more focused on the heat map. Experiments show that the RT-DETR-L-WMB model achieved low missing or false detection in the complex scenarios with the densely stacked water chestnuts or backgrounds mixed with the soil clods. The parameter count and recognition speed of the improved model were 34.7 M and 41.9 frame/s, respectively, fully meeting the real-time and lightweight requirements for the water chestnut picking scenarios. This finding can provide some technical support for picking and sorting the tuber crops, such as water chestnut.

HTML全文

参考文献(35)

施引文献

资源附件(0)