Abstract:
Harvesting red Sichuan pepper (
Zanthoxylum bungeanum) clusters in hilly and mountain orchards remains challenging due to dense canopy structures, irregular illumination, occlusions from leaves and branches, and strict requirements for bruise-free grasping. Traditional vision systems struggle to simultaneously maintain high detection precision and real-time performance on resource-limited hardware, because clusters are small, densely distributed, and easily confused with specular leaves and branch textures. This study aims to develop a lightweight visual perception and spatial localization framework that supports precise manipulation of clustered targets for an autonomous harvesting system operating in unstructured orchard environments. A field-collected dataset containing 999 images and 5,094 annotated instances was constructed to represent typical orchard scenes, capturing strong viewpoint and illumination variability. Based on the YOLO11n baseline model, a customized lightweight detector named PepNet v1 was designed by incorporating multi-scale convolutional attention in the detection head and hetero scale attention in the backbone. These components were tailored to enhance multi-scale feature aggregation for dense clusters and suppress background interference with minimal computational overhead. The model was trained with cosine-decayed learning rate scheduling, label smoothing, and L2 regularization, together with moderate color and geometric augmentation. Additionally, to address the failure of conventional single-pixel depth estimation under occlusion, empty-space interference, and inter-cluster hollow, a nearest multi-point statistical depth estimation strategy was introduced. The method computes depth from the minimum-range set of surrounding valid pixels rather than relying on a single center pixel, providing a bounded error trend suitable for parameterized correction. Experiments were conducted under both laboratory and unstructured orchard conditions using an Intel RealSense D435i depth camera and an industrial computer equipped with ONNX Runtime running exclusively on a central processing unit. The proposed PepNet v1 achieved a mean average precision at 0.5 threshold of 0.817 on the test set, outperforming the YOLO11n baseline while maintaining a computational cost of only 6.5 giga-floating-point operations and sustaining an average inference speed of 45.5 frames per second on 640×640 pixels input videos, balancing detection confidence with actuation safety. Ablation experiments demonstrated that multi-scale convolutional attention improved fine-grained cluster representation, particularly for small targets at long distances, while hetero scale attention reduced false activations in complex backgrounds. When combined, the two mechanisms increased precision without increasing inference latency, producing the most balanced performance among lightweight models. In the spatial localization task, the center-pixel method exhibited severe instability when depth information was missing or contaminated by background elements, with errors exceeding 80 centimeters in extreme cases. In contrast, the nearest multi-point method consistently underestimated true depth by approximately 1.52 centimeters but produced significantly lower variance across both static and motion-disturbed conditions. After applying a constant geometric correction derived from average cluster size statistics, the final localization error was reduced to 4.05 millimeters under static testing and 3.45 millimeters under dynamic perturbation, enabling millimeter-level accuracy suitable for collision-free grasp planning. Field experiments further validated system-level benefits: a pose-alignment harvesting strategy based on refined depth estimation increased the cluster-level grasp success rate to 75% and reduced cycle time to approximately 13 seconds, representing more than a 60% improvement compared to the unaligned control strategy. In conclusion, this study presents a unified lightweight perception framework integrating efficient object detection and stable depth estimation for clustered crops. The design can be deployed on industrial PCs without requiring a GPU, enabling practical automation in hilly and mountainous orchards. Implementation used ONNX-exported models and CPU-only inference to match cost-sensitive field platforms, and the complete pipeline outputs both 2D detections and camera-frame 3D coordinates at each frame. Ablation studies confirmed that the attention-enhanced head improved recall on partially occluded clusters, whereas backbone scale attention primarily reduced leaf-induced false positives, with negligible change in parameter count and throughput. While the method performs robustly in orchard environments, challenges remain regarding branch elasticity, real-time closed-loop control, and damage-aware gripping. Future work will incorporate tactile sensing, adaptive force control, and multi-modal fusion to enhance harvesting reliability under dynamic disturbances. Overall, the proposed framework provides a practical foundation for scalable robotic harvesting of clustered crops in unstructured agricultural environments.