Abstract:
This research is meticulously centered on addressing the core technical challenges currently hindering the efficacy of robotic vision systems within the domain of automated asparagus harvesting operations. In their natural growth state, asparagus spears are characterized by a slender morphology. When growing densely in field conditions, these tender stems are highly prone to mutual occlusion and overlapping phenomena. Furthermore, the simultaneous presence of stout "mother stems" creates complex background interference. These factors collectively precipitate a degradation in the accuracy of visual-based multi-target segmentation and recognition tasks, which in turn severely compromises the precise positioning and harvesting capabilities of the robotic end-effector.To surmount the aforementioned obstacles, this study adopts the lightweight instance segmentation model, YOLO11n-seg, as a baseline. Consequently, a novel, optimized model named YOLO11n-SAL is proposed, specifically tailored to handle slender, occluded targets with high fidelity. The core architectural advancements of the model are embodied in two newly introduced modules designed to enhance feature extraction and attention mechanisms.First, the Multi-scale Edge Enhancement Module (MEEM) was conceptually designed and integrated. The primary objective of this module is to mitigate the issue wherein the edge features of slender asparagus targets are inherently weak and easily lost during convolutional operations. By performing multi-scale decomposition on convolutional feature maps, the MEEM effectively extracts and intensifies edge and contour information separately before executing feature fusion. This process significantly elevates the model's sensitivity to target boundaries and improves segmentation precision, thereby enhancing the perceptual capability regarding targets with slender morphological structures.Secondly, the Separated and Enhancement Attention Module (SEAM) was introduced. This mechanism is specialized to rectify feature confusion and data incompleteness caused by inter-target occlusion. Through attention separation operations across both channel and spatial dimensions, SEAM enables the model to adaptively perceive local and global features of occluded asparagus at varying scales. By selectively enhancing and effectively fusing these features, the model is better positioned to focus on the visible subjects of partially masked targets while suppressing background noise and distractor information. This ensures that the system maintains robust detection and recognition performance even within complex, cluttered environments.To rigorously verify the effectiveness of the proposed model, systematic comparative experiments were conducted. Quantitative evaluation results indicate that the improved YOLO11n-SAL model achieved significant gains across all key performance indicators compared to the baseline model. In the target bounding box detection task, the model demonstrated superior performance with detection Precision of 94.2%, Recall rate of 83.1%, mean Average Precision at IoU threshold 0.5 ( mAP_0.5\left(Box\right) ) of 91.2%, and mean Average Precision at IoU threshold 0.5-0.95( mAP_0.5-0.95^ \left(Box\right) ) of 76.2%.In the more granular instance mask segmentation task, the model performed equally impressively. The segmentation Precision, Recall, mean Average Precision at IoU threshold 0.5( mAP_0.5\left(Mask\right) ), and mean Average Precision at IoU threshold 0.5( mAP_0.5-0.95\left(Mask\right) ) reached 93.4%, 77.9%, 90.7%, and 62.7%, respectively. Furthermore, heatmap analysis and a comparative assessment of detection and segmentation effects demonstrate that the YOLO11n-SAL model exhibits a marked improvement in perceiving asparagus edge features across different scenarios, as well as superior multi-target segmentation and recognition capabilities under occluded conditions. Compared to the baseline, the proposed method yields better detection and segmentation outcomes and effectively handles interference in complex situations, significantly boosting segmentation and recognition accuracy in multi-scenario environments.Finally, to validate the recognition and positioning performance in actual deployment scenarios, a series of asparagus recognition, positioning, and harvesting grasping trials were carried out using depth cameras and mechanical arms. The empirical results showed a positioning success rate of 90%, accompanied by effective harvesting and grasping performance. These findings confirm that the proposed system provides reliable technical support for the advancement of precise agricultural robotic harvesting operations.