高级检索+

红花椒簇采摘机器人视觉检测与多点定位系统设计与试验

Design and evaluation of a vision-based detection and multi-point localization system for harvesting red Sichuan pepper clusters

  • 摘要: 针对非结构化椒园中花椒簇尺寸小、分布密集、易损伤的采摘难题,该研究采集并标注999张田间图像,构建包含5094个实例的红花椒簇检测数据集。基于YOLO11n基线模型,在其Backbone与Head中分别引入异质尺度注意力与多尺度卷积模块构建轻量级模型PepNet v1,进而提升小目标花椒簇的识别能力。试验表明,PepNet v1在平均精确率均值和精确率上分别达到0.817和0.808,优于YOLO11n,运算量仅6.5 G(约65亿次浮点运算)。针对中心像素深度估计在孔洞和遮挡下误差较大的问题,提出基于最近N点统计均值的深度估计方法,将静态和动态误差分别降至1.30 和1.75 cm,修正后进一步降至0.41 和0.35 cm。田间试验表明,经姿态对齐优化后,采摘成功率达75%,效率提升60%以上,该方法可为红花椒簇自动采摘提供可靠参考。

     

    Abstract: Red Sichuan pepper (Zanthoxylum bungeanum) is one of the most important vegetables in the hilly and mountain orchards. Its harvesting is often required for the bruise-free grasping. However, some challenges still remained due to the dense canopy structures, irregular illumination, and the occlusions from the leaves and branches. Conventional vision can be expected to simultaneously maintain the high detection precision and real-time performance on the embedded hardware, particularly for the densely distributed, small clusters that are easily confused with the specular leaves and branch textures. In this study, a lightweight framework was developed for visual perception and spatial localization. The precise manipulation of the clustered targets during mechanical harvesting was also realized in the unstructured field environments. A dataset was collected with the 999 images in the field. The 5 094 annotated instances were constructed to represent the typical orchard scenes under strong viewpoint and illumination variability. According to the YOLOv11n baseline model, a customized lightweight detector (named PepNet v1) was designed to incorporate the multi-scale convolutional attention in the detection head and hetero scale attention in the backbone. These components were tailored to enhance the multi-scale feature aggregation for the dense clusters and then suppress the background interference with minimal computational overhead. The model was also trained with the cosine-decayed learning rate scheduling, label smoothing, and L2 regularization, together with the moderate color and geometric augmentation. Additionally, a nearest multi-point depth estimation was also introduced under the occlusion, empty-space interference, and inter-berry gaps, compared with the conventional single-pixel depth estimation. The depth was computed from the minimum-range set of the surrounding valid pixels, rather than a single center pixel, thus providing a bounded error trend suitable for the parameterized correction. Experiments were conducted under both laboratory and field conditions using an Intel RealSense D435i depth camera and an industrial computer equipped with ONNX Runtime running exclusively on a central processing unit. The PepNet v1 achieved a mean average precision at a 0.5 threshold of 0.817 on the test set. The YOLOv11n baseline was maintained a computational cost of only 6.5 giga-floating-point and an average inference speed of 45.52 frames per second on the 640×640 pixels input videos, thereby balancing the detection confidence with the actuation safety. Ablation experiments demonstrated that the multi-scale convolutional attention was improved the fine-grained cluster representation, particularly for the small targets at the long distances, while the hetero scale attention was reduced the false activations in the complex backgrounds. Two mechanisms also increased the precision without compromising the inference latency. The most balanced performance was achieved among lightweight models. In the spatial localization task, the conventional center-pixel approach exhibited severe instability when the depth information was missing or contaminated by the background elements, with the errors exceeding 80 cm in the extreme cases. In contrast, the nearest multi-point approach consistently underestimated the true depth by approximately 1.52 cm. But the lower variance was significantly produced over both static and motion-disturbed conditions. A constant geometric correction was derived from the average cluster size. The final localization error was reduced to 0.41 and 0.35 cm under static testing and dynamic perturbation, respectively, indicating the millimeter-level accuracy suitable for the collision-free grasp planning. Field experiments further validated the system-level benefits: A pose-alignment harvesting with the refined depth estimation increased the grasp success rate to 75% at the cluster level. The cycle time was reduced by approximately 13 s. More than a 60% improvement was obtained, compared with the unaligned control. In conclusion, a unified lightweight perception was presented to integrate efficient object detection and stable depth estimation for the clustered crops. ONNX-exported models and CPU-only inference matched the cost-sensitive field platforms. The deployable model was performed on the industrial PCs without GPU dependence in the mountain orchard. The pipeline also outputs both 2D detections and camera-frame 3D coordinates at each frame. Ablation tests confirmed that the enhancement of the attention head improved the recall on the partially occluded clusters, whereas the backbone scale attention reduced the leaf-induced false positives, with a negligible change in parameter count and throughput. While the branch elasticity, real-time closed-loop control, and damage-aware gripping were also maintained in orchard environments. Tactile sensing, adaptive force control, and multi-modal fusion can be incorporated to enhance the harvesting reliability under dynamic disturbances. Overall, the framework can provide a practical foundation for the scalable robotic harvesting of the clustered crops in unstructured agricultural environments.

     

/

返回文章
返回