基于YOLO-CRC的茄子及茄子梗分割与定位方法

刘芹; 欧阳婧玟; 周智彬; 张文峰; 张小花

doi:10.11975/j.issn.1002-6819.202504113

基于YOLO-CRC的茄子及茄子梗分割与定位方法

Segmentation and localization method for eggplants and stems based on YOLO-CRC

摘要

摘要: 针对茄子生长环境的复杂性和不确定性使得采摘机器人难以准确对采摘点进行定位的问题，该研究以YOLOv8-Seg模型为框架，提出了一种基于重参数结构的YOLO-CRC实例分割模型。首先在Backbone层引入Diverse Branch Block（DBB）与极化注意力机制构成CDCP模块，有效提高模型对小目标分割的能力；然后基于重参数化技术在Neck层中优化C2f模块，使用RepVit Block构成C2f_RVB模块；最后将上采样因子CARAFE代替原模型特征上采样操作以提高模型上采样恢复能力。消融试验结果表明，改进后的YOLO-CRC模型在平均精度均值（mAP_0.5）为94.1%，茄子及茄子梗分割精确度分别为95.9%和92.2%，mAP_0.5～0.95指标为70.6%。Grad-CAM++热力图分析表明改进后的模型注意力能够更加集中茄子及茄子梗上。基于改进后的分割模型，结合深度相机实现了采摘点三维定位，最后对改进模型进行了部署试验，该研究为茄子采摘机器人视觉识别与定位研究提供技术参考。

Abstract: Precise harvesting is often required in the complex natural environment of eggplants. In this study, an improved YOLO-CRC instance segmentation model was proposed with the reparameterized structure using the YOLOv8-Seg framework. A sample dataset of eggplants and their stems was also constructed with some variations in the illumination, orientation, and occlusion levels. Multimodal data augmentation was employed for the model exposure to more diverse data during training. Various transformations and processing were performed on the original data, thereby generating new training samples to expand the dataset. Firstly, a Diverse Branch Block (DBB) and a polarization attention mechanism were introduced to design the CDCP module in the Backbone layer. The performance of the model was significantly improved to segment the small targets. The DBB module was used to extract some features at the varying scales. Different sizes of the convolution kernel or operation types were utilized to reduce the complexities across its branches. These diverse outputs were then aggregated after summation. The information was integrated from the multiple layers. The DBB’s parallel structure was allowed for the effective fusion of the features at different scales during model inference, The stable performance of object detection was achieved even under the varying backgrounds and complexities. In the Neck layer, a reparameterization technique was applied to optimize the C2f module, thus introducing the RepVit Block to form the C2f_RVB module. The RVB module was enhanced the expressive power to capture the richer feature details using multi-scale convolution and a channel mixing mechanism. Furthermore, the RVB module was utilized the depthwise separable convolutions and an adaptive channel attention mechanism. The computational complexity was significantly reduced to break down the convolution. While the adaptive attention mechanism was strengthened the focus on the critical features, in order to weight the different feature channels. The CARAFE upsampling factor was used to replace the original ones for the upsampling recovery. The upsampling kernel was also tailored after a content-aware mechanism. Semantic information was balanced from the low-resolution feature maps, in order to more accurately reconstruct the spatial information. Computational efficiency was maintained to enhance the feature reuse, thus improving the fusion of the local and contextual information. Ablation experiments demonstrated that the improved YOLO-CRC model was achieved in a mean average precision of 94.1%. The segmentation accuracies were 95.9% and 92.2%, respectively, for the eggplants and eggplant pedicels. While the mAP at an IoU range of 0.5 to 0.95 was 70.6%. A dataset was also constructed to test the generalization performance of the model on new data, except for the training. The YOLO-CRC model exhibited the favorable generalization and excellent processing on new data. The overall processing speed of the model was slightly slower than that of the rest models; However, the frame rate of 44 frames per second (FPS) was sufficient to fully meet the requirements of most experimental scenarios. Grad-CAM++ heatmap analysis revealed that the improved model was focused its attention more effectively on eggplants and pedicels. As such, the 2D localization of eggplant pedicel harvesting points was achieved using the enhanced segmentation model. Furthermore, 3D localization of the harvesting points was successfully realized to integrate a depth camera. Ten experiments were conducted on the three-dimensional positioning of the depth camera, where the five orientations were selected. The experimental results show that the average positioning error was 2.13 mm, which referred to the mean value of the total errors from each measurement; The maximum error was 2.68 mm; and the average relative error was 1.18%. This finding can provide the valuable technical insights into the visual recognition and localization in the eggplant harvesting robots. Great contribution can be gained to advance the automated harvesting technologies for the eggplants crops.

HTML全文

参考文献(33)

施引文献

资源附件(0)