基于多项式回归和堆叠模型的花生产量预测

漆海霞; 黄荟良; 罗锡文; 黄世淳; 胡炼

doi:10.11975/j.issn.1002-6819.202407244

基于多项式回归和堆叠模型的花生产量预测

Peanut yield prediction based on polynomial regression and stacked model

摘要

摘要: 为科学管理农业活动并提升花生产量预测精度，针对现有研究多依赖单一模型、难以捕捉气象因子与产量的复杂非线性关系，以及传统趋势分解方法（如移动平均法、高通滤波法）对长期趋势拟合不足等问题，该研究以广东省粤西南地区为研究区域，构建了一种基于多项式回归与堆叠模型的花生产量预测模型。基于2000—2023年粤西南16个地区的气象数据（气温、降水、日照、风速、相对湿度5种气象因子）及产量数据，首先采用多项式回归拟合趋势产量，表征科技进步与农业水平对产量的长期影响；其次，利用主成分分析对归一化后的气象数据降维，消除冗余并提取累计贡献率达90%的前12个主成分变量；最后，构建堆叠模型，以K最近邻、随机森林、梯度提升回归为基学习器，Lasso回归为元学习器，结合交叉验证策略集成多算法优势，解析气象因子与气象产量的非线性关系。结果表明，基于多项式回归与堆叠模型的花生产量预测模型的平均绝对百分比误差为2.09%，均方根误差为78.55 kg/hm²，决定系数R²达0.96，较多项式回归与单一机器学习方法组合相比，平均绝对百分比误差降低0.22～0.68个百分点；采用花生生育期内不同月份的气象数据构建的产量预测试验显示，花生产量最早可以在营养生长期进行准确预测，预测时间可以提前至收获前2个月；在2020—2023年验证中，该预测模型平均绝对百分比误差均值为4.62%，表明其在不同年份的气候条件下仍然保持稳定性。该研究提出的模型通过融合趋势与气象动态影响，兼具高精度与提前预测能力，对于构建其他作物产量预测模型也具有一定的参考意义。

Abstract: Agricultural yield forecast can often rely mainly on the single model approach at present. The single model is trained on all available features for prediction. However, the approach cannot capture the complex nonlinear relationships between meteorological factors and yield. Traditional decomposition can also be limited to the long-term trend fitting (e.g., moving average, high-pass filtering). In this study, an integrated model was developed to forecast the peanut yield in the southwestern Guangdong Province, China. Meteorological data (temperature, precipitation, sunshine duration, wind speed, and relative humidity) and yield records were collected from the 16 test regions from 2000 to 2023. Polynomial regression and stacking ensemble learning were integrated using three systematic procedures: 1) Long-term trend modeling with polynomial regression was used to quantify the impact of technological advancements and agricultural management on the yield; 2) Dimensionality reduction via principal component analysis (PCA) was employed to extract the 12 principal components with a cumulative variance contribution of 90% from normalized meteorological data; 3) The base learners were set as the stacked generalization framework with the K-nearest neighbors (KNN), random forest (RF), and gradient boosting regressors (GBR). While the Lasso regression was set as the meta-learner. The cross-validation was optimized for the meteorological yield analysis. The model showed excellent performance, with a mean absolute percentage error reduction of 0.22~0.68 percentage points compared to the combination of polynomial regression and a single machine learning method. Specifically, the polynomial regression with the stacked model was achieved in the lowest MAPE (2.09%), MAE (57.10 kg/hm²), and RMSE (78.55 kg/hm²), compared with the rest models, such as the KNN (MAPE: 2.70%), RF (MAPE: 2.31%), and GBR (MAPE: 2.77%). In addition, its R² value is as high as 0.96, indicated that the combined model can explain 96% of the variance in the actual production data, demonstrating its high accuracy of prediction, demonstrating its high accuracy of prediction. According to the pre-August meteorological inputs (two months pre-harvest), early forecast testing also maintained a high accuracy (R²=0.94), with a MAPE of 2.91% and MAE of 71.88 kg/hm². The high effectiveness of the improved was provided for the early yield forecast. A series of trials were carried out to validate the improved model for 2020－2023. The robustness of the model was further confirmed, with an average MAPE of 4.62% in the different regions. However, regional variations were also observed in the accuracy of the forecast. The MAPE ranged from 0.25% in Yunfu to 8.15% in Zhanjiang. There was also the strong influence of regional heterogeneity and non-meteorological factors, such as soil properties and farming practices. Different decomposition of the trend was also compared, including the moving average, exponential smoothing, high-pass filtering, and polynomial regression. Polynomial regression outperformed the rest. Among them, the long-term yield trends driven by technological advancements were accurately captured to smooth out the effects of extreme weather events. A more stable and accurate separation of trend and meteorological yield was obtained for the precise forecast. In conclusion, the polynomial regression was integrated for the trend analysis. A stacked ensemble model was suitable for the meteorological yield forecast. A robust and accurate framework was offered to forecast the peanut yield. Early forecasts were also provided for the regional variations during agricultural management and market strategy. Future research can further enhance to incorporate the additional variables, such as the soil properties and satellite data. The region-specific models can also be expected to consider the local agricultural practices and environmental conditions.

HTML全文

参考文献(55)

施引文献

资源附件(0)