Abstract:
Plant height is a key phenotypic trait that reflects crop growth status, supports yield prediction, and guides precision management. Monitoring wheat plant height holds significant scientific and practical value for field-based precision agriculture and breeding cultivation. However, traditional field-based plant height measurements are often labor-intensive, time-consuming, and prone to human error, limiting their scalability and consistency in large-scale agricultural settings.
In this study, we propose an automated wheat height estimation framework that integrates semantic segmentation across the entire growth stages. The approach utilizes multi-temporal and multi-view UAV RGB images, combined with Structure from Motion (SfM)-based 3D reconstruction to generate Digital Surface Models (DSM) and Digital Terrain Models (DTM). The Crop Height Model (CHM) is then derived by differencing DSM and DTM. Meanwhile, an improved SegFormer-based semantic segmentation model is employed to accurately extract wheat canopy regions from field images, effectively eliminating background noise such as soil and non-vegetation elements. Based on the resulting segmentation masks and CHM, an automatic height inversion system is established to achieve accurate and efficient estimation of wheat canopy height. Specifically, the masks are used to isolate canopy regions in the CHM, and the 95th percentile of height values within each connected vegetation cluster is used to represent its canopy height. These values are aggregated at the plot level to derive average wheat height at each growth stage, enabling reliable phenotypic analysis and temporal monitoring. In the segmentation model’s encoder, a parallel structure combining a CNN-based detail branch and a Transformer-based semantic branch is employed to enhance the synergistic representation of local texture features and global contextual information. The CNN branch captures subtle edge structures and local texture variations, particularly critical during early growth stages when canopies are sparse and fragmented. In contrast, the Transformer branch encodes long-range dependencies and semantic context, enabling robust representation of large-scale canopy structures. In the decoder, progressive upsampling and skip connections are combined with a feature fusion module to effectively integrate multi-scale features, thereby improving the preservation of spatial information and the refinement of boundary details. To address the challenge of effectively combining heterogeneous features from the dual-branch encoder and multi-scale decoder, an Aggregation Layer is introduced as a dedicated feature fusion module. This module employs a combination of point-wise multiplication and addition to enhance feature complementarity between local and global representations. Additionally, it incorporates convolutional refinement, normalization, and channel recalibration mechanisms to improve the stability and semantic consistency of the fused output. Experimental results demonstrate that the proposed model achieves mean intersection over union (mIoU), mean pixel accuracy (mPA), and pixel accuracy (PA) values of 80.92%, 89.42%, and 90.07%, respectively, outperforming the original SegFormer model. The estimated canopy heights show a strong correlation with field measurements, with a coefficient of determination (R2) of 0.985, a root mean square error (RMSE) of 0.73cm, and a relative RMSE (rRMSE) of 2.41%. In addition, the method successfully captures the temporal dynamics of wheat growth, revealing consistent height accumulation trends across stages. These results demonstrate that the proposed method can efficiently and accurately retrieve spatial distribution and plant height information of wheat canopies, providing reliable technical support for dynamic monitoring of winter wheat growth, field phenotyping, and precision agriculture applications.