Abstract:
Orange (Citrus sinensis) has been one of the most key fruit varieties, due to its outstanding economic value. Its large-scale planting has been an important carrier to promote the rural revitalization in the hilly areas. Current manual harvesting cannot fully meet the large-scale production in recent years, due to the high labor costs, low efficiency and prone to fruit damage. Particularly, the shortage of agricultural labor is ever increasingly prominent, with the urbanization and aging population structure. Mechanization and intelligence transformation of harvesting operations can be inevitable for the industrial development. This study aims to enhance the efficiency and accuracy of the real-time orange detection in the unstructured orchard environments under the constraints of the embedded edge computing platforms. A lightweight object detection model, named YOLOv8n-Light, was proposed in alignment with the Roofline performance model, in order to increase the computational intensity for the memory access. Thereby, a systematic optimization was also made to balance between resource consumption and detection accuracy. The baseline YOLOv8n network was modified to replace its backbone with the lightweight ShuffleNetV2 architecture. ShuffleNetV2 was utilized the channel splitting, pointwise convolution, and depthwise separable convolution. The parameter size and computational cost were minimized to extract the fine-grained features. Furthermore, a novel lightweight detection head was introduced on the top of this backbone, according to the shared 3×3 convolutional kernels across feature pyramid levels. The redundant parameter storage and activation memory traffic were significantly reduced to a streamlined and more efficient pipeline. A concatenation module was restructured to incorporate the SE (Squeeze-and-Excitation) attention mechanism. The channel-wise responses were recalibrated using feature importance. The SE module was enhanced the network's sensitivity to the relevant object features under complex conditions, such as the varying illumination, background clutter, and partial occlusion. The loss function was redesigned to integrate the MPDIoU (Minimum Point Distance Intersection over Union) and the Focaler-IoU (Focalized Intersection over Union), in order to improve the localization. This hybrid loss function was imposed the stronger penalties on the inaccurate bounding boxes and dynamically balanced the precision-recall trade-off, according to the quality of each prediction. As a result, there were the high regression accuracy and robust detection sensitivity. A series of the experiments were conducted on a Raspberry Pi 4B platform with 8 GB of RAM. The YOLOv8n-Light model reached an inference speed of 2.8 FPS (frames per second). There was the 64.7% increase, compared with the original YOLOv8n. The strong performance of the detection was attained a precision of 96.5%, which is 2.2 percentage points higher than the original YOLOv8n model, a recall of 89.5%, and a mAP (mean Average Precision) of 97.0%. Field evaluations were carried out in an orchard using a six-degree-of-freedom robotic arm equipped with an Intel RealSense depth camera. The average positioning errors were measured as 2.48 mm along the X-axis, 3.13 mm along the Y-axis, and 4.13 mm along the Z-axis. The robotic fruit-picking system was achieved in a recognition accuracy of 97.59%, a localization accuracy of 96.39%, and an overall picking success rate of 93.98%. The applicability of the system was improved under real-world agricultural conditions. In conclusion, the YOLOv8n-Light model was effectively balanced the computational efficiency and detection accuracy on the resource-constrained embedded platforms. An optimized loss function was integrated with the architectural improvements and attention mechanisms. The reliable performance was achieved in both controlled and real-world orchard environments. The lightweight refinement of the citrus fruit detection can serve as a strong reference for the automated harvesting equipment.