Abstract:
Winter wheat can play a crucial role in grain production in China. An accurate prediction of the grain yield is of great significance for food security in sustainable agriculture. Fortunately, remote sensing can provide an effective means for the large-scale prediction of the crop yield; However, the mainstream approaches of the convolutional–recurrent neural network cannot fully capture the global spatial–spectral features, and long-range temporal dependencies in the long-term remote sensing imagery. The generalization and spatial stability of the winter wheat yield prediction can be limited in the large and heterogeneous production regions. In this study, an accurate model was proposed to predict the winter wheat yield using global–local spatiotemporal feature extraction from the remote sensing imagery, named as the global–local spatiotemporal feature extraction network (GSTFEN). A dual-branch architecture was adopted to combine the convolutional neural networks (CNN) and vision transformers (ViT). The local texture and spectral response features were extracted for the global spatial–spectral information from the multi-temporal MODIS imagery at the county scale. A coupled attention fusion module (CAFM) was introduced to adaptively integrate the spatial–spectral information, according to the bidirectional interactions between global and local features. The complementary information was jointly exploited from the CNN and ViT modules. A temporal encoder with Transformer was further employed to capture the long-range temporal dependencies in the crop growth cycle, because the growth stages were more critical for the final yield formation. The spatial, spectral, and temporal dynamics of the winter wheat growth were determined to improve the accuracy of the yield prediction using GSTFEN. A unified framework was provided for the spatiotemporal feature learning in the prediction of the large-scale yield. The study area was taken from the major winter wheat-producing regions in China. A county-level dataset of the yield prediction was constructed to integrate the winter wheat yield data from 2003 to 2022 with MODIS imagery. In the 2019–2022 test set, the GSTFEN was achieved in the annual average of the root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (
R2) of 0.591 t/hm
2, 0.475 t/hm
2, and 0.848, respectively, thus outperforming the baseline models by 7%-40%. Scatter plot analysis showed that the GSTFEN predictions closely followed the 1:1 line, where the most counties were observed in the absolute errors below 0.5 t/hm
2. It was totally different from the conventional models with the large deviations in the high- and low-yield regions. The spatial error maps indicate that the larger residuals were concentrated mainly in northwestern and southwestern areas, with the fragmented fields and mixed pixels. While the major producing regions, such as the Huang–Huai–Hai Plain, also exhibited low prediction errors and high spatial consistency. Ablation experiments confirm that the dual CNN–ViT architecture, the CAFM, and the temporal encoder with Transformer greatly contributed to the accuracy and stability. Moreover, the annual average RMSE at the sowing-to-heading stage differed by only 2.31% from that at the sowing-to-maturity stage. The GSTFEN can be expected to effectively predict the large-scale yield of the winter wheat approximately one month before harvest. While the prediction accuracy was maintained early in the season at the county scale. These findings can provide a strong reference to extract the global–local spatiotemporal features for the yield prediction of the cereal crops.