Abstract:
Loquat flower buds are typically observed to develop in dense and clustered distributions during their natural growth cycle. To ensure optimal nutrient allocation, prevent fruit size disparities, and guarantee the final commercial quality of the fruits, timely and precise thinning operations during the early bud stage are fundamentally required. However, accurate identification and spatial localization of these targets remain significantly challenging in practical agricultural scenarios. This difficulty primarily stems from the inherently small volumetric size of the buds, their clustered growth patterns, and their severe visual similarity in color to the complex surrounding background foliage. To comprehensively address these critical issues and facilitate automated agricultural operations, this study proposes a novel target detection algorithm, termed PSMF-YOLO, which integrates unmanned aerial vehicle (UAV) remote sensing imagery with advanced deep learning techniques. Initially, a specialized loquat flower bud dataset was constructed using high-resolution images acquired by UAVs operating in real-world orchard environments. Building upon this empirical data, an improved detection model was developed using the YOLOv11n architecture as the baseline to effectively balance computational efficiency and detection accuracy. To systematically optimize the network for these microscopic targets, several crucial structural modifications were sequentially implemented. First, a dedicated P2 small-object detection layer was introduced into the architecture. This layer leverages shallower network layers to preserve high-resolution spatial details, thereby fundamentally enhancing the model's intrinsic capability to accurately detect minute and densely packed targets. Second, an advanced Space-to-Depth Convolution (SPD-Conv) module was incorporated into the feature extraction network. By retaining detailed information during the space-to-depth transformation and avoiding the redundant information loss typically caused by traditional stride convolutions, this module significantly improves detection accuracy, particularly for low-resolution images and small objects. Third, the C2PSA module was combined with a multi-scale dilated attention (MSDA) mechanism. This integration achieves comprehensive cross-level perception through multi-scale dilated convolutions and dynamic feature fusion strategies, enabling the network to establish a more effective global context modeling capability. Finally, a Focaler-DIoU bounding box regression loss function was employed. This targeted loss function effectively alleviates the inherent sample imbalance commonly encountered in small-object detection tasks and improves the precise localization of bounding box regression, thereby substantially enhancing the overall robustness of the model when deployed under highly complex orchard conditions. Comprehensive experimental results demonstrate the superior performance of the proposed PSMF-YOLO model. Specifically, the enhanced model achieves remarkable improvements of 4.9, 2.3, and 4.2 percentage points in mean average precision (mAP
0.5), precision (P), and recall (R) metrics, respectively, when directly compared with the baseline YOLOv11n model. Furthermore, when evaluated against various mainstream object detection frameworks, the PSMF-YOLO model outperforms the best-performing alternative baseline, the YOLOv5s model, by margins of 0.8, 0.1, and 0.6 percentage points across the aforementioned core evaluation metrics. In conclusion, these quantitative results clearly indicate that the proposed PSMF-YOLO method provides a highly effective and accurate algorithmic solution for UAV-based small-target detection tasks in agricultural remote sensing. Consequently, this study offers robust technical support for the broader implementation of intelligent and precise orchard management systems.