Abstract:
Red Sichuan pepper (Zanthoxylum bungeanum) is one of the most important vegetables in the hilly and mountain orchards. Its harvesting is often required for the bruise-free grasping. However, some challenges still remained due to the dense canopy structures, irregular illumination, and the occlusions from the leaves and branches. Conventional vision can be expected to simultaneously maintain the high detection precision and real-time performance on the embedded hardware, particularly for the densely distributed, small clusters that are easily confused with the specular leaves and branch textures. In this study, a lightweight framework was developed for visual perception and spatial localization. The precise manipulation of the clustered targets during mechanical harvesting was also realized in the unstructured field environments. A dataset was collected with the 999 images in the field. The 5 094 annotated instances were constructed to represent the typical orchard scenes under strong viewpoint and illumination variability. According to the YOLOv11n baseline model, a customized lightweight detector (named PepNet v1) was designed to incorporate the multi-scale convolutional attention in the detection head and hetero scale attention in the backbone. These components were tailored to enhance the multi-scale feature aggregation for the dense clusters and then suppress the background interference with minimal computational overhead. The model was also trained with the cosine-decayed learning rate scheduling, label smoothing, and L2 regularization, together with the moderate color and geometric augmentation. Additionally, a nearest multi-point depth estimation was also introduced under the occlusion, empty-space interference, and inter-berry gaps, compared with the conventional single-pixel depth estimation. The depth was computed from the minimum-range set of the surrounding valid pixels, rather than a single center pixel, thus providing a bounded error trend suitable for the parameterized correction. Experiments were conducted under both laboratory and field conditions using an Intel RealSense D435i depth camera and an industrial computer equipped with ONNX Runtime running exclusively on a central processing unit. The PepNet v1 achieved a mean average precision at a 0.5 threshold of 0.817 on the test set. The YOLOv11n baseline was maintained a computational cost of only 6.5 giga-floating-point and an average inference speed of 45.52 frames per second on the 640×640 pixels input videos, thereby balancing the detection confidence with the actuation safety. Ablation experiments demonstrated that the multi-scale convolutional attention was improved the fine-grained cluster representation, particularly for the small targets at the long distances, while the hetero scale attention was reduced the false activations in the complex backgrounds. Two mechanisms also increased the precision without compromising the inference latency. The most balanced performance was achieved among lightweight models. In the spatial localization task, the conventional center-pixel approach exhibited severe instability when the depth information was missing or contaminated by the background elements, with the errors exceeding 80 cm in the extreme cases. In contrast, the nearest multi-point approach consistently underestimated the true depth by approximately 1.52 cm. But the lower variance was significantly produced over both static and motion-disturbed conditions. A constant geometric correction was derived from the average cluster size. The final localization error was reduced to 0.41 and 0.35 cm under static testing and dynamic perturbation, respectively, indicating the millimeter-level accuracy suitable for the collision-free grasp planning. Field experiments further validated the system-level benefits: A pose-alignment harvesting with the refined depth estimation increased the grasp success rate to 75% at the cluster level. The cycle time was reduced by approximately 13 s. More than a 60% improvement was obtained, compared with the unaligned control. In conclusion, a unified lightweight perception was presented to integrate efficient object detection and stable depth estimation for the clustered crops. ONNX-exported models and CPU-only inference matched the cost-sensitive field platforms. The deployable model was performed on the industrial PCs without GPU dependence in the mountain orchard. The pipeline also outputs both 2D detections and camera-frame 3D coordinates at each frame. Ablation tests confirmed that the enhancement of the attention head improved the recall on the partially occluded clusters, whereas the backbone scale attention reduced the leaf-induced false positives, with a negligible change in parameter count and throughput. While the branch elasticity, real-time closed-loop control, and damage-aware gripping were also maintained in orchard environments. Tactile sensing, adaptive force control, and multi-modal fusion can be incorporated to enhance the harvesting reliability under dynamic disturbances. Overall, the framework can provide a practical foundation for the scalable robotic harvesting of the clustered crops in unstructured agricultural environments.