Abstract:
Rapid and accurate detection of pear fruit growth quality combined with real-time picking-point localization in complex orchard environments is a key technological enabler for intelligent agricultural harvesting robots. To address the challenges of low recognition accuracy under occlusion and variable illumination, as well as insufficient inference speed on embedded devices, this study proposes an improved pear fruit detection and localization method based on YOLOv8n incorporating three targeted enhancements. First, the C2f modules in the backbone network are replaced with FasterNet Block structures utilizing partial convolution, significantly reducing computational redundancy and optimizing memory access efficiency. Second, a global attention mechanism is introduced after the spatial pyramid pooling fast layer to enhance the extraction of critical feature information, enabling the model to focus more effectively on small targets and suppress background interference. Third, the original CIoU loss function is replaced with Inner-CIoU using a scale factor of 0.8 after systematic experimentation, providing improved gradient characteristics and accelerating convergence while enhancing localization precision for small and overlapping pear fruits. A dedicated pear fruit dataset was constructed using images captured by an Intel RealSense D455i binocular camera in natural orchards, covering multiple varieties and challenging conditions including diseases, fruit overlap, and occlusion. Data augmentation expanded the dataset to 3,000 images. Experimental results demonstrate that the proposed YOLOv8n-Pear model achieves a precision of 96.80%, a recall of 93.40%, and a mean average precision of 96.70%. Compared with the baseline YOLOv8n, these metrics are improved by 4.0, 3.2, and 4.0 percentage points, respectively. Moreover, the model reduces floating-point operations by 30.23% and memory footprint by 48.15%, from 7.1 MB to 4.2 MB. On the embedded Jetson Orin NX platform, the model achieves an average inference speed of 180.3 frames per second with a power consumption of only 19 W, demonstrating its suitability for real-time deployment on power-constrained systems. For 3D localization, the binocular camera was calibrated using Zhang’s method, and a coordinate transformation pipeline was established to convert 2D pixel coordinates of detected healthy pear fruits into 3D world coordinates. Field tests show that the maximum positioning errors in the X, Y, and Z directions are 12 mm, 12 mm, and 10 mm, respectively, with average errors of 6.6 mm, 7.1 mm, and 7.1 mm, all within acceptable limits for robotic harvesting. This identification and localization system was deployed on a self-developed pear-picking actuator, and field tests were conducted outdoors with a picking platform. The success rate of fruit picking was approximately 90.2%, with an average continuous picking speed of about 5 seconds per fruit. This study effectively addresses technical challenges in visual perception for fruit harvesting robots and is suitable for deployment on resource-constrained devices. It provides critical recognition and localization support for harvesting robots targeting fruits such as pears.