Abstract:
Accurate recognition of the grape picking point is essential for the intelligent, efficient, and non-destructive harvesting in grape-picking robots. However, the robustness of the 3D localization is often confined to the various influencing factors, such as the occlusions, irregular lighting, and the complex spatial distribution of the grape clusters under unstructured orchard environments. Therefore, it is highly required to enhance the overall reliability of the harvesting decisions. In this study, a dual-modal visual perception and cognition framework was proposed to integrate both 3D point clouds and 2D RGB images for the robust and precise picking point localization under diverse orchard conditions. Firstly, the 3D semantic scene was selected. Point Transformer V2 (PTV2), a point-cloud processing model was incorporated the grouped vector attention and relative positional encoding, in order to capture both local geometric structures and long-range contextual dependencies. Point clouds were acquired from a depth camera, and then semantically segmented into the classes, such as the grapes, stems, and branches. The structural foundation was formed for the subsequent geometric analysis. The PTV2 model was achieved in the high accuracy of the segmentation, with a mean Intersection over Union (mIoU) of 89.83%, and the IoU values for the peduncle and branche were 78.55% and 84.20% respectively. The strong recognition was realized in the real orchard scenarios. A 3D Grape Picking Point Localization Algorithm (3D GPPLA) was proposed to determine the picking points within the complex arrangements of the grape cluster. A two-stage clustering was introduced using DBSCAN and K-Means. Multi-cluster grapes were semantically segmented into the independent candidate cluster point clouds. After that, the morphology validation was implemented to determine whether each cluster was corresponded to an individual grape bunch. The recursive depth was also restricted to prevent the over-segmentation for the computational efficiency. Once the invalid partition was obtained within the allowed depth, the system was rolled back into a previous clustering state for the stability. Once a single grape bunch was identified, 3D GPPLA was used to estimate the picking point, according to the spatial relationship between the grape centroid and the peduncle region. Specifically, the minimum bounding box around the grape-peduncle subset was computed to determine the optimal picking direction. The peduncle proximity and accessibility were evaluated for the minimal damage during separation, and the consistent harvesting performance. A 2D fallback strategy was also introduced to further enhance the robustness, in cases where the 3D approach failed, due to the severe occlusion, missing depth data, or segmentation noise. Once one failure was detected, the 2D RGB image was extracted to switch into the image inference. By leveraging SegFormer, a state-of-the-art transformer-based semantic segmentation network, two-dimensional images are partitioned into high-fidelity semantic representations, including grape and peduncle, among others. The 2D GPPLA algorithm was then computed the picking points in the image space, according to the shape heuristics and spatial priors. Subsequently, the 3D point cloud was projected after depth-aligned pixel mapping. The fallback mechanism was enhanced the resilience in the cluttered and partially observable environments. While the richer texture and color cues available in RGB images were also compensated for the limitations in the point-cloud resolution and sensor noise. A custom dataset was comprised 1,847 grape clusters that collected under natural orchard conditions. The 3D GPPLA was achieved in a picking-point localization success rate of 89.11%. Particularly, there were the success rates of 98.81% in the single-cluster scenarios and 80.95% in the multi-cluster arrangements, highlighting its adaptability to the varying levels of the structural complexity. When combined with the 2D fallback strategy, the high overall reliability was significantly reduced the failure cases under the cluttered and occluded scenarios. As such, the framework was integrated with the advanced 3D semantic segmentation, adaptive multi-stage clustering, and cross-modal compensation. The accurate, stable, and efficient picking-point localization was achieved in the unstructured vineyard environments. This finding can also provide a solid technical contribution to the practical deployment of the grape-harvesting robots in smart agriculture.