Abstract:
Accurate and robust instance segmentation of tomato fruit clusters is a prerequisite for automated harvesting in modern greenhouses. However, the inherent complexity of protected cultivation environments—including dense fruit distribution, severe occlusion by vines, high intra-class similarity, small-scale fruit stems, and dynamic lighting—poses substantial challenges to visual perception systems. Existing deep learning–based segmentation models often suffer from incomplete mask boundaries, missed detections under heavy occlusion, and insufficient real-time performance, particularly when deployed on lightweight platforms required by agricultural robots. To overcome these limitations, the present study aimed to design a lightweight yet highly accurate segmentation model tailored for greenhouse tomato harvesting. This study developed an improved YOLOv11-based instance segmentation framework, termed YOLOv11_DLQM, by enhancing feature extraction, contextual modeling, detection head design, and bounding box regression. The proposed model incorporates a Diverse Branch Block (DBB) in the backbone to expand receptive field diversity and strengthen fine-grained feature representation without increasing inference cost. To further address the challenge of occlusion and multi-scale variation, the standard SPPF module was upgraded with a Large Separable Kernel Attention (LSKA) mechanism, enabling the network to capture long-range contextual cues, suppress background interference, and enhance the discriminability of fruit-stem boundaries. In terms of the detection head, the Quality-aware Lightweight Shared Convolutional Detector (QLSCD) structure was adopted, which organically combined Lightweight Shared Convolution (LSCD) and Localization Quality Estimation (LQE) to achieve cross-scale parameter sharing and enhance the consistency of classification confidence and localization quality. Furthermore, the Mean Pixel Distance IoU (MPDIoU) was employed as the regression loss to accelerate bounding box convergence, mitigate jitter during training, and enhance the quality of mask prediction in dense fruit clusters. Extensive experiments conducted on a large-scale greenhouse tomato dataset demonstrated that YOLOv11_DLQM achieved substantial improvements over the baseline YOLOv11n and other state-of-the-art instance segmentation models. In the box detection, the precision, recall rate and mAP50 of the YOLOv11_DLQM model reached 87.1%, 88.9% and 92.8% respectively, which were improved by 2.3%, 7.1% and 7.7% respectively compared with the baseline YOLOv11n model. In the mask detection, the precision, recall rate and mAP50 of the YOLOv11_DLQM model reached 86.2%, 88.0% and 92.5% respectively, which were increased by 1.0%, 5.5% and 5.0% respectively compared with the baseline YOLOv11n model. Meanwhile, after adding multiple modules, a low parameter quantity of 2.84 M was maintained. On the RTX
4090 platform, the single-frame inference time of the YOLOv11_DLQM model was approximately 2.2 ms, remaining within the real-time performance range of milliseconds. Ablation studies confirmed the effectiveness of each enhanced module and its synergistic contribution to overall performance. After introducing the DBB module, the accuracy of box and mask detection increased by 3.2% and 3.7% respectively, while the number of model parameters and floating-point operation values remained unchanged. The introduction of LSKA enhanced the detection performance in slight and heavy vine occlusion scenarios. In the slight occlusion scenario, compared with the baseline YOLOv11n, the mask accuracy was improved from 81.3% to 84.3%, and the mAP50 was enhanced from 82.7% to 85.5%. In heavy occlusion scenarios, the mask accuracy was raised from 78.4% to 80.9%. Although the recall rate slightly dropped from 80.1% to 78.6%, the overall mAP50 of the model increased from 80.2% to 82.4%. QLSCD improved the box mAP50 by 1.0%, reduced GFLOPs to 9.1, shortened the inference time to 1.2 ms, decreased the number of model parameters from 2.84 M to 2.56 M. The recall rate of MPDIoU in the box detection task was 88.9%, and the misjudgment rate among the three categories of "vine", " stem", and "bunch" was reduced. Comparative evaluations with models such as convnext, mask_rcnn, solov2, yolact, rtmseg, YOLOv8n, and YOLOv11n revealed that YOLOv11_DLQM provides the best balance between accuracy, robustness, and computational efficiency. Visualization results further demonstrated that the proposed model maintains high-quality segmentation under strong illumination, weak illumination, multi-color clusters, highly overlapping fruits, and both slight and heavy vine occlusions. To evaluate cross-variety generalization, additional images of French Saopolo tomatoes grown in the Beijing Cuihu facility and 192-variety tomatoes collected in Suzhou were analyzed. Despite the differences in fruit morphology, cluster compactness, and reflectance properties, YOLOv11_DLQM consistently produced accurate and complete masks without edge distortion or false segmentation, highlighting its robustness across diverse production environments. In conclusion, the YOLOv11_DLQM model offers a lightweight, high-accuracy, and highly robust solution for real-world greenhouse tomato perception. The superior performance under complex environmental conditions and across different tomato varieties indicates strong application potential in automated harvesting systems.