基于改进YOLOv8s与RGB-D信息融合的番茄采摘机器人识别与定位方法

樊湘鹏; 张岩琪; 周硕; 任美飞; 王雨薇; 柴秀娟

doi:10.11975/j.issn.1002-6819.202503181

基于改进YOLOv8s与RGB-D信息融合的番茄采摘机器人识别与定位方法

Recognition and position method of tomato picking robot based on improved YOLOv8s and RGB-D information fusion

摘要

摘要: 采摘机器人视觉感知系统是完成选择性收获的基本条件。针对在非结构设施环境中番茄特定种植模式约束下，现有采摘机器人果实识别与定位精度低导致的采收成功率不高等问题，该研究面向番茄采摘机器人性能提升，在自建数据集基础上，提出一种基于改进YOLOv8s和RGB-D多源信息融合感知的番茄采摘机器人识别与定位方法。首先，利用空间重构卷积单元和通道重构卷积单元进行串联组合构建空间和通道重建卷积结构实现轻量化目标；然后，在YOLOv8s颈部结构中引入SimAM三维注意力机制，使模型更加关注复杂环境下番茄的关键特征；最后使用MPDIoU损失改进损失函数，解决因果实重叠导致的检测帧失真问题，减少番茄的漏检现象。结果表明，相比于原始YOLOv8s模型，改进模型的平均精度均值（mAP₅₀）由91.49%增加到95.81%，模型大小由22.5 M减小到17.6 M，推理时间也从10.6 ms减少到8.7 ms，与YOLOv5、YOLOv6、YOLOv7、YOLOv8、YOLOv9和YOLOv10系列模型相比，改进的YOLOv8s模型在识别精度、检测速度和计算效率上具有明显优势。在改进YOLOv8s模型基础上，将RGB图像与Depth图像信息配准并融合获取番茄中心点在空间中的位置坐标，利用RealSense D435 RGB-D相机和NVIDIA Jetson AGX Orin边缘设备构建视觉感知决策系统，自主开发了番茄采摘执行系统并开展了定位精度和采摘抓取试验。单帧图像平均耗时50 ms，采摘抓取成功率为94.73%，损伤率仅为4.17%。该研究有效解决了果实采收机器人视觉感知环节面临的技术挑战，并且适合在性能受限设备上部署应用，可为番茄等果实采收机器人识别与定位提供关键技术支撑。

Abstract: A harvesting robot has been one of the most important equipment for intensive production in modern agriculture. Among them, the visual perception of the harvesting robot aims to recognize, evaluate, and position the picking target. It is also required for the basic condition of selective harvesting. However, the existing fruit recognition and positioning can be limited to low accuracy and efficiency. The low harvesting success rate and high damage rate can be found under the constraints of the unstructured facility environment and tomato planting mode. In this study, a tomato picking recognition and positioning system was proposed using improved YOLOv8s and RGB-D information fusion. The RGB and Depth images were also captured in the field of view in the tomato-picking area using the Intel RealSense depth camera D435. Then the data sets were constructed to label the images. The spatial reconstructed convolution unit (SRCU) and channel reconstructed Convolution unit (CRCU) were designed to form the convolution reconstruction unit SCRConv to modify the neck structure of YOLOv8s. The representative features of the tomatoes with different maturity levels were better learned under the low-cost operations and feature reuse. The lightweight model was developed to recognize the tomatoes in the complex field environment. The non-parameterized 3D weights attention mechanism - SimAM with human brain attention was introduced into the neck structure, in order to identify the key features of the tomato in the complex environment. Finally, the MPDIoU loss was used to update the CIoU_loss function. The missing detection was reduced from the frame distortion caused by fruit overlap. Furthermore, the spatial location of the tomato fruit was established using the improved YOLOv8s model. The RGB and Depth information were also fused to obtain the tomato targets and their location in the harvesting robot. The experimental results showed that the values of P and R in the improved YOLOv8s model increased by 4.03 and 4.45 percentage points, respectively, compared with the original. The mAP50 value of the improved YOLOv8s model increased from 91.49% to 95.81%, the model size decreased from 22.5 to 17.6 M, and the inference time decreased from 10.6 ms to 8.7 ms. The recognition model performed better in the recognition accuracy, speed, and calculation efficiency, compared with the YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv9, and YOLOv10 series. The RGB and Depth image information was aligned and then merged to obtain the position coordinates of the center point of the tomato. The visual perception decision-making was constructed by using the RealSense D435 RGB-D camera and the NVIDIA Jetson AGX Orin edge device. The positioning accuracy of the tomatoes shared an error of less than 4 mm within a working range of 1.0 m, thus fully meeting the requirements of picking accuracy. The picking and grasping experiments were also carried out guided by the visual perception in the laboratory. The average time of a single frame image was less than 50 ms, the overall success rate was 94.73%, and the damage rate was only 4.17%. The improved model was suitable for the recognition and localization of the tomato on the performance-limited equipment. The finding can also provide strong technical support to the visual detection and deployment of the fruit harvesting robots.

HTML全文

参考文献(44)

施引文献

资源附件(0)