Abstract:
Early and accurate detection of fall armyworm (Spodoptera frugiperda) infestation in maize fields using unmanned aerial vehicle (UAV) imagery is of paramount importance for implementing timely and effective pest management strategies. Despite its significance, achieving reliable detection remains highly challenging due to several factors. Firstly, the feeding marks caused by the larvae are often very small and subtle, making them difficult to identify at high altitudes. Secondly, significant variations in object scale occur across different flight heights, which complicates consistent recognition. Finally, under real-world field conditions, the low contrast between damaged leaf tissue and surrounding healthy foliage further hampers accurate detection, especially when lighting and environmental conditions vary. Collectively, these challenges underscore the need for advanced computer vision techniques capable of robustly identifying early signs of infestation across diverse scales and complex backgrounds. This study aimed to develop a robust deep learning model capable of reliably identifying these subtle infestation traces in multi-scale UAV images, thereby supporting precision agriculture applications. A novel detection architecture, termed Coordinated-BiFPN-P2-YOLO (CBP-YOLO), was developed based on YOLOv8. To address image degradation from low ground sampling distance, the Real-Enhanced Super-Resolution Generative Adversarial Network (Real-ESRGAN) was applied as a preprocessing step to reconstruct high-fidelity textures of leaf damage from original low-resolution inputs. The backbone of YOLOv8 was enhanced with the Coordinated Attention (CA) mechanism, which jointly captures spatial and channel-wise feature dependencies to improve localization and discrimination of minute lesions. Furthermore, the neck component was upgraded with a Bi-directional Feature Pyramid Network (BiFPN)to enable efficient top-down and bottom-up cross-scale feature fusion, minimizing information loss during hierarchical propagation and ensuring consistent representation across scales. In addition, a dedicated detection head operating at a 160×160 spatial resolution with 64-channel output was added to specifically strengthen sensitivity to small targets. The model was trained and evaluated on a custom UAV-collected dataset acquired over maize fields naturally infested by fall armyworm, encompassing diverse lighting conditions and flight heights. Extensive experiments demonstrated that CBP-YOLO achieved peak performance on imagery with a ground sampling distance (GSD)of 0.38 centimeters per pixel. The integration of Real-ESRGAN significantly alleviated texture blurring and edge ambiguity in low-resolution images, leading to clearer delineation of feeding scars. Ablation studies were conducted to evaluate the effectiveness of the proposed improvements, and the results demonstrate that the enhanced model achieves outstanding performance on the UAV multi-scale blade dataset. Specifically, the model attains an average precision (AP@0.5) of 76.5%, reflecting a notable increase of 3.4 percentage points compared to the baseline model. These findings indicate that the proposed modifications substantially enhance the model’s detection capability across blades of varying scales, highlighting the robustness and practical applicability of the improved approach in aerial inspection scenarios. In comparative evaluations, CBP-YOLO outperformed state-of-the-art detectors—including YOLOv9 medium, YOLOv10 medium, YOLOv11 medium, Faster Region-based Convolutional Neural Network, and RetinaNet—by margins of 10.1, 7.2, 5.1, 9.3, and 17.9 percentage points in AP@0.5, respectively. Notably, the model maintained high precision under varying illumination and partial occlusion, demonstrating strong generalization in real-world agricultural environments. The proposed CBP-YOLO framework effectively addresses the core challenges of detecting subtle, multi-scale fall armyworm infestation signs in UAV-based maize monitoring. By synergistically combining super-resolution enhancement, attention-aware feature extraction, fine-grained detection heads, and bidirectional multi-scale fusion, the model delivers superior accuracy and robustness. This approach provides a practical and scalable solution for early pest outbreak detection, enabling timely intervention and reducing crop losses in large-scale maize production systems.