Abstract:
A harvesting robot has been one of the most important equipment for intensive production in modern agriculture. Among them, the visual perception of the harvesting robot aims to recognize, evaluate, and position the picking target. It is also required for the basic condition of selective harvesting. However, the existing fruit recognition and positioning can be limited to low accuracy and efficiency. The low harvesting success rate and high damage rate can be found under the constraints of the unstructured facility environment and tomato planting mode. In this study, a tomato picking recognition and positioning system was proposed using improved YOLOv8s and RGB-D information fusion. The RGB and Depth images were also captured in the field of view in the tomato-picking area using the Intel RealSense depth camera D435. Then the data sets were constructed to label the images. The spatial reconstructed convolution unit (SRCU) and channel reconstructed Convolution unit (CRCU) were designed to form the convolution reconstruction unit SCRConv to modify the neck structure of YOLOv8s. The representative features of the tomatoes with different maturity levels were better learned under the low-cost operations and feature reuse. The lightweight model was developed to recognize the tomatoes in the complex field environment. The non-parameterized 3D weights attention mechanism - SimAM with human brain attention was introduced into the neck structure, in order to identify the key features of the tomato in the complex environment. Finally, the MPDIoU loss was used to update the CIoU_loss function. The missing detection was reduced from the frame distortion caused by fruit overlap. Furthermore, the spatial location of the tomato fruit was established using the improved YOLOv8s model. The RGB and Depth information were also fused to obtain the tomato targets and their location in the harvesting robot. The experimental results showed that the values of
P and
R in the improved YOLOv8s model increased by 4.03 and 4.45 percentage points, respectively, compared with the original. The mAP50 value of the improved YOLOv8s model increased from 91.49% to 95.81%, the model size decreased from 22.5 to 17.6 M, and the inference time decreased from 10.6 ms to 8.7 ms. The recognition model performed better in the recognition accuracy, speed, and calculation efficiency, compared with the YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv9, and YOLOv10 series. The RGB and Depth image information was aligned and then merged to obtain the position coordinates of the center point of the tomato. The visual perception decision-making was constructed by using the RealSense D435 RGB-D camera and the NVIDIA Jetson AGX Orin edge device. The positioning accuracy of the tomatoes shared an error of less than 4 mm within a working range of 1.0 m, thus fully meeting the requirements of picking accuracy. The picking and grasping experiments were also carried out guided by the visual perception in the laboratory. The average time of a single frame image was less than 50 ms, the overall success rate was 94.73%, and the damage rate was only 4.17%. The improved model was suitable for the recognition and localization of the tomato on the performance-limited equipment. The finding can also provide strong technical support to the visual detection and deployment of the fruit harvesting robots.