基于Transformer稀疏点云葡萄语义分割

谢元澄; 高宇阳; 李添天; 戴倩; 姜海燕

doi:10.11975/j.issn.1002-6819.202403220

基于Transformer稀疏点云葡萄语义分割

Semantic segmentation of sparse point clouds in grapes using transformer-based approach

摘要

摘要: 农业智能采摘中，点云语义分割是实现无人采摘、果实定位、准确切割的必要前提。现有农业场景下语义分割研究多数基于稠密点云数据，稠密点云数据的获取难度大、成本高，稀疏点云数据导致了点云语义分割效果较差。针对数据稀疏的问题，基于pointnet算法中点的方法引入Transformer多头自注意力机制，构造点云语义分割方法SP-Transformer，将点云划分为多级窗口，使注意力机制聚焦在窗口局部特征，建立密集键与稀疏键的多尺度融合策略，以此扩大感受野来捕获远距离的上下文依赖关系，并在注意力最开始采用特征高级嵌入的方式，提升稀疏点云的分割效果。在葡萄数据测试集上平均准确率达到89.9%，对葡萄的分割准确率达到了81.1%。试验结果表明，对于低密度点云SP-Transformer方法依然可以保持较好的分割效果。

Abstract: Grapes are one of the world's major fruit crops and hold a significant position in agricultural production. Currently, grape cultivation in China is predominantly carried out through manual methods, with limited adoption of mechanization in the agricultural sector. This reliance on labor-intensive practices results in high costs and low efficiency, which hinders the sustainable development of the grape industry. Therefore, accelerating the transition toward mechanization and intelligent automation is crucial to addressing these challenges. In the context of agricultural intelligent harvesting, point cloud semantic segmentation serves as a foundational technology for achieving unmanned harvesting, accurate fruit localization, and precise cutting. However, most existing research on semantic segmentation in agricultural scenarios is based on dense point cloud data, which is difficult and costly to acquire. Sparse point cloud data, on the other hand, often leads to suboptimal segmentation performance, posing a significant barrier to the practical implementation of intelligent harvesting systems. To address the issue of sparse data, this study introduces a novel point cloud semantic segmentation method called SP-Transformer, which builds upon the PointNet algorithm and incorporates a transformer multi-head self-attention mechanism. The SP-Transformer method divides the point cloud into multi-level windows, enabling the attention mechanism to focus on local features within each window. This approach employs a multi-scale fusion strategy that combines dense and sparse key features to expand the receptive field and capture long-range contextual dependencies. Additionally, high-level feature embedding is applied at the initial stage of the attention process to enhance the segmentation performance of sparse point clouds. Specifically, the three-dimensional space is first partitioned into non-overlapping cubic windows, with points distributed across different cubic windows. Each query point only considers neighboring points within the same cubic window, and independent multi-head self-attention operations are performed within each window. Since the point cloud is divided into small windows, the receptive field of each query point is inherently limited. To address this limitation, cross-window communication is introduced by shifting the window between two consecutive Transformer blocks by half the window size, thereby increasing contextual connections and expanding the receptive field. The input point cloud is divided into non-overlapping cubic windows based on a predefined window size (ws). For each query point (qi), the corresponding set of points (kden) within the same cubic window is identified. Simultaneously, the input point cloud is downsampled to construct a sparse sampling window, dividing the downsampled space into non-overlapping cubic windows with a size of twice the window size (ws). The set of points (kspa) belonging to the larger cubic window of qi is then identified. Finally, the two sampling sets are combined to form the final sampling point set, which is used for further processing. This innovative approach ensures that the model can effectively capture both local and global features, even in sparse point cloud data.Experimental results on a grape dataset demonstrate the effectiveness of the SP-Transformer method, achieving an average accuracy of 89.9%, which is 4.4 percentage points higher than PointNet++ and 1.5 percentage points higher than PointTransformer. The segmentation accuracy for grape classes reached 81.1%, with an average intersection over union (IoU) of 82.8%. To further validate the robustness of the method, the point cloud completion algorithm PENet was employed to densify the point cloud, increasing the point cloud density from approximately 60,000 points to around 500,000 points. By comparing the performance of dense and sparse data, the SP-Transformer method was trained and evaluated for both real-time performance and segmentation accuracy. The results indicate that the SP-Transformer method maintains robust segmentation performance even for low-density point clouds, outperforming traditional methods such as PointNet++ in terms of both accuracy and efficiency.The SP-Transformer method represents a significant advancement in point cloud semantic segmentation, particularly in agricultural applications where sparse point cloud data is prevalent. By leveraging the multi-head self-attention mechanism and multi-scale fusion strategy, the method effectively addresses the challenges posed by sparse data, enabling more accurate and efficient segmentation. This improvement is critical for the development of intelligent harvesting systems, as it enhances the ability to accurately locate and segment fruits, thereby reducing labor costs and increasing operational efficiency.Moreover, the SP-Transformer method has broader implications beyond agricultural applications. Its ability to handle sparse point cloud data makes it suitable for other domains such as autonomous driving, robotics, and environmental monitoring, where sparse data is often encountered. The method's innovative use of cross-window communication and high-level feature embedding provides a robust framework for capturing both local and global features, making it a versatile tool for various computer vision tasks.In conclusion, the SP-Transformer method offers a powerful and efficient solution for point cloud semantic segmentation in agricultural intelligent harvesting. By addressing the challenges of sparse data and enhancing the ability to capture long-range contextual dependencies, this method paves the way for more accurate and reliable intelligent harvesting systems. Its success in achieving high segmentation accuracy on sparse point cloud data underscores its potential to revolutionize the field of precision agriculture and contribute to the broader advancement of computer vision technologies.

HTML全文

参考文献(35)

施引文献

资源附件(0)