Abstract:
The "Pile picking" method used for harvesting green Sichuan pepper branches refers to a targeted pruning technique in which the fruit-bearing branches are selectively cut, while deliberately preserving short stumps of a predetermined and specific length. This approach ensures the retention of part of the branch structure to support future growth, optimize harvesting efficiency, and maintain the overall health of the plant. To enable the Sichuan pepper harvesting robot to accurately recognize branches and determine the optimal cutting points for efficient short-stump cutting in complex field environments with dense foliage and varying illumination, this study proposes a method for localizing short-stump cutting points on the main branches of green Sichuan pepper based on the U-Net deep learning network and RGB-D depth camera. The method integrates semantic segmentation for branch identification with depth information for spatial localization, establishing a complete processing pipeline from image acquisition to cutting point coordinate determination. First, the traditional U-Net model is improved by replacing its backbone network with ResNet50 embedded with a Coordinate Attention (CA) mechanism, which strengthens the model's ability to capture spatially fine-grained features, consequently enhancing both the boundary completeness and segmentation precision of branch structures, and the Squeeze-and-Excitation (SE) attention mechanism is added in the feature splicing stage of the U-Net model to adaptively recalibrate channel-wise feature responses, thereby constructing a robust segmentation model for the main branches and trunk of Sichuan pepper that effectively distinguishes target structures from complex backgrounds including leaves, fruits, and interfering branches. Then, the segmented images of the main branches and trunk are binarized, and the Zhang & Suen algorithm is used to extract the centerline of the main branches by integrating depth information from the RGB-D camera with OpenCV image processing algorithms. The pixel length in the pixel coordinate system was converted to the physical length in the physical coordinate system through camera intrinsic parameters including focal length and pixel size. It was then transformed into the actual length in the world coordinate system by incorporating depth measurements from the RGB-D camera. Spatial geometric transformations were applied to establish accurate coordinate mappings. The length mapping between the world coordinate system and the pixel coordinate system is achieved, enabling accurate metric-scale measurements of branch dimensions in three-dimensional space for determining the 40 mm stump length. The predefined short stake length of 40 mm is then accurately mapped from the world coordinate system to the corresponding pixel scale in the RGB images, establishing a quantitative correspondence between the physical spatial length and the image pixel dimension. This mapping enables precise localization and determination of the optimal pruning points on each main branch within the image plane. Experimental results clearly demonstrate that the improved U-Net model exhibits superior segmentation performance when compared to other advanced semantic segmentation models such as DeepLabV3+ and PSPNet. Specifically, the enhanced U-Net achieves the Mean Intersection over Union (MIoU) of 87.58%, the mean Pixel Accuracy (mPA) of 93.76%, and the Recall rate of 96.24%, indicating its robustness and effectiveness in accurately identifying and segmenting target features within the image data. Under different lighting conditions, the success rates for identifying and locating the pruning points were 90.81% in direct light, 84.88% in backlight conditions, and 80.52% in cloudy conditions. In the cutting point localization experiment, the localization success rate was 90%, and the average identification process of a single branch took 1.93 s. The results of this study can provide technical support for the " pile picking " harvesting of green Sichuan pepper picking robots.