Abstract:
High accuracy is often required to recognize the mud-adhered water chestnuts in the conveying stage of harvesters. However, it is still challenging under complex situations, such as soil adhesion and occlusion by soil clods during identification. In this study, an improved recognition was proposed for the water chestnut using RT-DETR-L (Real-Time Detection Transformer Large). Firstly, a Wavelet Transform Hybrid Group Block (WTHGBlock) was adopted to optimize the backbone network for the discriminative power of the target features. Secondly, a Multi-scale Head Dynamic Interaction Attention Feature Integration module (MSHD-AIFI) was constructed to recognize the small targets for the multi-scale performance. Three-level "local, medium, and global" branches were used to capture the features of varied sizes. The weights were adjusted dynamically. Finally, a Biqi Morphology-aware IoU Loss (BM-IoU Loss) was also introduced to incorporate the morphological constraints, size weighting, and region optimization. The localization deviations were then reduced due to the morphological similarity and occlusion. Each module was verified in the recognition task of the water chestnut under cluttered soil conditions. The RT-DETR-L was used as the baseline model. Three modules were designed for the ablation experiments. In the WTHGBlock module, the precision, recall, and mAP0.5 reached 85.0%, 91.1%, and 94.3%, respectively, which were improved by 1.2, 0.7, and 1.0 percentage points, compared with the baseline model. In the MSHD-AIFI module, the recall and mAP0.5 reached 91.3% and 93.6%, respectively, which were improved by 0.9 and 0.3 percentage points over the baseline, while the precision decreased by 0.7 percentage points. The perception of the small targets and occluded areas was enhanced after multi-scale dynamic attention. But the background noise also led to a slight increase in the false detections. In the BM-IoU Loss module, the precision, recall, and mAP0.5 were 84.4%, 91.7%, and 93.4%, respectively, which were improved by 0.6, 1.3, and 0.1 percentage points over the baseline RT-DETR-L model. When all three modules were integrated into the baseline RT-DETR-L model, the RT-DETR-L-WMB model achieved optimal performance, with the improvements of 1.4, 1.8, and 1.6 percentage points, compared with the baseline model. Among the three modules, the WTHGBlock enhanced the feature discriminability, the MSHD-AIFI module improved the multi-scale coverage, and the BM-IoU Loss was used to improve the localization accuracy. Their functions jointly improved the performance of the recognition in the complex scenes. The RT-DETR-L-WMB model achieved a recognition precision, recall, and mAP0.5 of 85.2%, 92.2%, and 94.9%, respectively. Compared with the mainstream models of the YOLO series, these metrics were improved by 2.4~3.9, 0.9~1.7, and 3.4~4.1 percentage points, respectively. In terms of efficiency, the parameter count of the improved model was reduced by 10.9~23.8 M, whereas the speed of the detection increased by 1.0~6.7 frame/s, compared with the mainstream models of the YOLO series. The performance was significantly enhanced for the higher detection speed with fewer computational resources. The actual target regions were much more focused on the heat map. Experiments show that the RT-DETR-L-WMB model achieved low missing or false detection in the complex scenarios with the densely stacked water chestnuts or backgrounds mixed with the soil clods. The parameter count and recognition speed of the improved model were 34.7 M and 41.9 frame/s, respectively, fully meeting the real-time and lightweight requirements for the water chestnut picking scenarios. This finding can provide some technical support for picking and sorting the tuber crops, such as water chestnut.