基于改进YOLOv11n的番茄实例分割方法

艾昕旖; 熊梓茗; 元嘉策; 袁挺

doi:10.11975/j.issn.1002-6819.202509073

基于改进YOLOv11n的番茄实例分割方法

Tomato instance segmentation method using improved YOLOv11n

摘要

摘要: 针对温室番茄果串识别中存在的目标密集、遮挡严重、细节难分割以及模型实时性与精度不足的问题，考虑番茄果串、果梗与主茎之间的相邻约束，该研究提出了一种改进的YOLOv11n用于视觉识别番茄果串。首先，在主干网络中引入DBB（diverse branch block）多分支模块，利用其多分支结构显著增强模型的特征多样性表征能力；其次，将SPPF（serial parallel pooling fusion）空间金字塔模块池化层后的卷积层替换为LSKA（large separable kernel attention）动态大卷积核空间注意机制，通过大尺度可分离卷积与空间-通道协同注意力增强模型对复杂枝叶遮挡及多尺度目标的上下文感知能力；进而，采用LSCD（lightweight shared convolutional detector）轻量型共享卷积检测头并集成LQE（localization quality estimation）本地化质量评估模块，通过参数共享与结构重设计大幅降低计算冗余，实现对预测框质量的独立精准评估，有效优化后处理中的排序与筛选机制；最后，采用MPDIoU（mean pixel distance intersection of union）平均像素距离交并比损失函数加速边界框回归过程的收敛，提升训练稳定性与定位精度。试验结果表明，YOLOv11_DLQM模型的边框和掩码平均精度均值分别达到92.8%和92.5%，较基线YOLOv11n分别提高7.7和5.0个百分点；在与mask_rcnn、solov2、yolact、rtmseg、lightsolo及YOLOv8n等模型的对比中，该模型在仅2.84 M参数量下表现出更优的综合性能，为温室环境下番茄果串的自动化精准采摘提供了一种高效、可靠的视觉感知解决方案。

Abstract: Accurate and robust instance segmentation of tomato fruit clusters is a prerequisite for automated harvesting in modern greenhouses. However, the inherent complexity of protected cultivation environments—including dense fruit distribution, severe occlusion by vines, high intra-class similarity, small-scale fruit stems, and dynamic lighting—poses substantial challenges to visual perception systems. Existing deep learning–based segmentation models often suffer from incomplete mask boundaries, missed detections under heavy occlusion, and insufficient real-time performance, particularly when deployed on lightweight platforms required by agricultural robots. To overcome these limitations, the present study aimed to design a lightweight yet highly accurate segmentation model tailored for greenhouse tomato harvesting. This study developed an improved YOLOv11-based instance segmentation framework, termed YOLOv11_DLQM, by enhancing feature extraction, contextual modeling, detection head design, and bounding box regression. The proposed model incorporates a Diverse Branch Block (DBB) in the backbone to expand receptive field diversity and strengthen fine-grained feature representation without increasing inference cost. To further address the challenge of occlusion and multi-scale variation, the standard SPPF module was upgraded with a Large Separable Kernel Attention (LSKA) mechanism, enabling the network to capture long-range contextual cues, suppress background interference, and enhance the discriminability of fruit-stem boundaries. In terms of the detection head, the Quality-aware Lightweight Shared Convolutional Detector (QLSCD) structure was adopted, which organically combined Lightweight Shared Convolution (LSCD) and Localization Quality Estimation (LQE) to achieve cross-scale parameter sharing and enhance the consistency of classification confidence and localization quality. Furthermore, the Mean Pixel Distance IoU (MPDIoU) was employed as the regression loss to accelerate bounding box convergence, mitigate jitter during training, and enhance the quality of mask prediction in dense fruit clusters. Extensive experiments conducted on a large-scale greenhouse tomato dataset demonstrated that YOLOv11_DLQM achieved substantial improvements over the baseline YOLOv11n and other state-of-the-art instance segmentation models. In the box detection, the precision, recall rate and mAP50 of the YOLOv11_DLQM model reached 87.1%, 88.9% and 92.8% respectively, which were improved by 2.3%, 7.1% and 7.7% respectively compared with the baseline YOLOv11n model. In the mask detection, the precision, recall rate and mAP50 of the YOLOv11_DLQM model reached 86.2%, 88.0% and 92.5% respectively, which were increased by 1.0%, 5.5% and 5.0% respectively compared with the baseline YOLOv11n model. Meanwhile, after adding multiple modules, a low parameter quantity of 2.84 M was maintained. On the RTX 4090 platform, the single-frame inference time of the YOLOv11_DLQM model was approximately 2.2 ms, remaining within the real-time performance range of milliseconds. Ablation studies confirmed the effectiveness of each enhanced module and its synergistic contribution to overall performance. After introducing the DBB module, the accuracy of box and mask detection increased by 3.2% and 3.7% respectively, while the number of model parameters and floating-point operation values remained unchanged. The introduction of LSKA enhanced the detection performance in slight and heavy vine occlusion scenarios. In the slight occlusion scenario, compared with the baseline YOLOv11n, the mask accuracy was improved from 81.3% to 84.3%, and the mAP50 was enhanced from 82.7% to 85.5%. In heavy occlusion scenarios, the mask accuracy was raised from 78.4% to 80.9%. Although the recall rate slightly dropped from 80.1% to 78.6%, the overall mAP50 of the model increased from 80.2% to 82.4%. QLSCD improved the box mAP50 by 1.0%, reduced GFLOPs to 9.1, shortened the inference time to 1.2 ms, decreased the number of model parameters from 2.84 M to 2.56 M. The recall rate of MPDIoU in the box detection task was 88.9%, and the misjudgment rate among the three categories of "vine", " stem", and "bunch" was reduced. Comparative evaluations with models such as convnext, mask_rcnn, solov2, yolact, rtmseg, YOLOv8n, and YOLOv11n revealed that YOLOv11_DLQM provides the best balance between accuracy, robustness, and computational efficiency. Visualization results further demonstrated that the proposed model maintains high-quality segmentation under strong illumination, weak illumination, multi-color clusters, highly overlapping fruits, and both slight and heavy vine occlusions. To evaluate cross-variety generalization, additional images of French Saopolo tomatoes grown in the Beijing Cuihu facility and 192-variety tomatoes collected in Suzhou were analyzed. Despite the differences in fruit morphology, cluster compactness, and reflectance properties, YOLOv11_DLQM consistently produced accurate and complete masks without edge distortion or false segmentation, highlighting its robustness across diverse production environments. In conclusion, the YOLOv11_DLQM model offers a lightweight, high-accuracy, and highly robust solution for real-world greenhouse tomato perception. The superior performance under complex environmental conditions and across different tomato varieties indicates strong application potential in automated harvesting systems.

HTML全文

参考文献(30)

施引文献

资源附件(0)