Abstract:
Harvesting robots are often required for the rapid and accurate detection of pear fruit growth quality with real-time picking-point localization under complex orchard environments in smart agriculture. In this study, an improved YOLOv8n model was proposed for the pear fruit detection and localization, particularly for the high accuracy under occlusion and variable illumination, while the high inference speed on embedded devices. Three enhancements were incorporated: (1) The C2f modules in the backbone network were replaced with FasterNet Block using partial convolution. Computational redundancy was significantly reduced to optimize memory access efficiency. (2) A global attention mechanism was introduced after the spatial pyramid pooling fast (SPPF) layer. Critical feature was extracted to focus more effectively on small targets, while suppressing background interference. (3) The original CIoU loss function was replaced with Inner-CIoU using a scale factor of 0.8 after systematic experimentation. The convergence of the improved model was accelerated to enhance the gradient and localization precision for small and overlapping pear fruits. An image dataset was also constructed to verify the improved model. Pear fruit images were captured by an Intel RealSense D455i binocular camera in natural orchards. Multiple varieties and challenging conditions were covered, such as diseases, fruit overlaps, and occlusion. Data augmentation also expanded the dataset to 3,000 images. Experimental results demonstrate that the improved YOLOv8n-Pear model achieved a precision of 96.80%, a recall of 93.40%, and a mean average precision of 96.70%. Compared with the baseline YOLOv8n, these metrics were improved by 4.0, 3.2, and 4.0 percentage points, respectively. Moreover, the floating-point operations were reduced by 30.23% and memory footprint by 48.15%, from 7.1 to 4.2 MB. On the embedded Jetson Orin NX platform, the better performance achieved an average inference speed of 180.3 frames per second with a power consumption of only 19 W, indicating the real-time deployment on power-constrained systems. The binocular camera was calibrated for 3D localization. A coordinate transformation was established to convert 2D pixel coordinates of healthy pear fruits into 3D world ones. Field tests show that the maximum positioning errors in the X, Y, and Z directions were 12, 12, and 10 mm, respectively, with average errors of 6.6, 7.1 and 7.1 mm, respectively, all within acceptable limits for robotic harvesting. Finally, the vision system was integrated with a four-degree-of-freedom harvesting actuator on outdoor Y-trellis pear trees. The better performance was achieved in the harvesting success rate of approximately 90.2% and an average continuous picking time of about 5 s per fruit over ten experimental groups with 100 picking times, fully meeting the practical requirements of robot harvesting. The improved YOLOv8n model effectively balanced the high accuracy and low computational cost. The finding can also provide a robust solution for visual perception in fruit harvesting robots, particularly under resource-constrained embedded platforms.