红花椒簇采摘机器人视觉检测与多点定位系统设计与试验

唐座亮; 赵汉齐; 陈霖; 何洋; 何继梅; 张铁誉; 王玉超; 许丽佳

doi:10.11975/j.issn.1002-6819.202509304

红花椒簇采摘机器人视觉检测与多点定位系统设计与试验

Design and evaluation of a vision-based detection and multi-point localization system for harvesting red Sichuan pepper clusters

摘要

摘要: 针对非结构化椒园中花椒簇尺寸小、分布密集、易损伤的采摘难题，该研究采集并标注999张田间图像，构建包含5094个实例的红花椒簇检测数据集。基于YOLO11n基线模型，在其Backbone与Head中分别引入异质尺度注意力与多尺度卷积模块构建轻量级模型PepNet v1，进而提升小目标花椒簇的识别能力。试验表明，PepNet v1在平均精确率均值和精确率上分别达到0.817和0.808，优于YOLO11n，运算量仅6.5 GFLOPs（约65亿次浮点运算）。针对中心像素深度估计在孔洞和遮挡下误差较大的问题，提出基于最近N点统计均值的深度估计方法，将静态和动态误差降至1.30 cm和1.75 cm，修正后进一步降至4.05 mm和3.45 mm。田间试验表明，经姿态对齐优化后，采摘成功率达75%，效率提升60%以上，该方法可为红花椒簇自动采摘提供可靠参考。

Abstract: Harvesting red Sichuan pepper (Zanthoxylum bungeanum) clusters in hilly and mountain orchards remains challenging due to dense canopy structures, irregular illumination, occlusions from leaves and branches, and strict requirements for bruise-free grasping. Traditional vision systems struggle to simultaneously maintain high detection precision and real-time performance on resource-limited hardware, because clusters are small, densely distributed, and easily confused with specular leaves and branch textures. This study aims to develop a lightweight visual perception and spatial localization framework that supports precise manipulation of clustered targets for an autonomous harvesting system operating in unstructured orchard environments. A field-collected dataset containing 999 images and 5,094 annotated instances was constructed to represent typical orchard scenes, capturing strong viewpoint and illumination variability. Based on the YOLO11n baseline model, a customized lightweight detector named PepNet v1 was designed by incorporating multi-scale convolutional attention in the detection head and hetero scale attention in the backbone. These components were tailored to enhance multi-scale feature aggregation for dense clusters and suppress background interference with minimal computational overhead. The model was trained with cosine-decayed learning rate scheduling, label smoothing, and L2 regularization, together with moderate color and geometric augmentation. Additionally, to address the failure of conventional single-pixel depth estimation under occlusion, empty-space interference, and inter-cluster hollow, a nearest multi-point statistical depth estimation strategy was introduced. The method computes depth from the minimum-range set of surrounding valid pixels rather than relying on a single center pixel, providing a bounded error trend suitable for parameterized correction. Experiments were conducted under both laboratory and unstructured orchard conditions using an Intel RealSense D435i depth camera and an industrial computer equipped with ONNX Runtime running exclusively on a central processing unit. The proposed PepNet v1 achieved a mean average precision at 0.5 threshold of 0.817 on the test set, outperforming the YOLO11n baseline while maintaining a computational cost of only 6.5 giga-floating-point operations and sustaining an average inference speed of 45.5 frames per second on 640×640 pixels input videos, balancing detection confidence with actuation safety. Ablation experiments demonstrated that multi-scale convolutional attention improved fine-grained cluster representation, particularly for small targets at long distances, while hetero scale attention reduced false activations in complex backgrounds. When combined, the two mechanisms increased precision without increasing inference latency, producing the most balanced performance among lightweight models. In the spatial localization task, the center-pixel method exhibited severe instability when depth information was missing or contaminated by background elements, with errors exceeding 80 centimeters in extreme cases. In contrast, the nearest multi-point method consistently underestimated true depth by approximately 1.52 centimeters but produced significantly lower variance across both static and motion-disturbed conditions. After applying a constant geometric correction derived from average cluster size statistics, the final localization error was reduced to 4.05 millimeters under static testing and 3.45 millimeters under dynamic perturbation, enabling millimeter-level accuracy suitable for collision-free grasp planning. Field experiments further validated system-level benefits: a pose-alignment harvesting strategy based on refined depth estimation increased the cluster-level grasp success rate to 75% and reduced cycle time to approximately 13 seconds, representing more than a 60% improvement compared to the unaligned control strategy. In conclusion, this study presents a unified lightweight perception framework integrating efficient object detection and stable depth estimation for clustered crops. The design can be deployed on industrial PCs without requiring a GPU, enabling practical automation in hilly and mountainous orchards. Implementation used ONNX-exported models and CPU-only inference to match cost-sensitive field platforms, and the complete pipeline outputs both 2D detections and camera-frame 3D coordinates at each frame. Ablation studies confirmed that the attention-enhanced head improved recall on partially occluded clusters, whereas backbone scale attention primarily reduced leaf-induced false positives, with negligible change in parameter count and throughput. While the method performs robustly in orchard environments, challenges remain regarding branch elasticity, real-time closed-loop control, and damage-aware gripping. Future work will incorporate tactile sensing, adaptive force control, and multi-modal fusion to enhance harvesting reliability under dynamic disturbances. Overall, the proposed framework provides a practical foundation for scalable robotic harvesting of clustered crops in unstructured agricultural environments.

HTML全文

参考文献(38)

施引文献

资源附件(0)