Abstract:
Accurate and rapid detection of mango fruits and peduncles can often be required in natural environments. However, the current detection is limited to similar background and target colors, foliage occlusion, as well as overlapping fruits. In this study, an improved model, MAL-YOLOv10n, was proposed to detect the mango fruits and peduncles using the YOLOv10n framework. Specifically, the detection accuracy of small target peduncles was optimized to tackle the challenges. Several modifications were made to the architecture and key modules, resulting in significant improvements in the model performance. Firstly, the original C2f module was reengineered in the backbone network in order to create an RFAM-C2f module. In the traditional C2f module, the Bottleneck structure relied mainly on standard 3×3 convolutions for feature extraction. The current approach often failed to capture the global contextual information in the complex scenes with similar background hues, occlusion, or overlapping fruits. To overcome this limitation, the 3×3 convolution in the Bottleneck was replaced by the receptive-field attention convolution (RFAConv). An attention mechanism, RFAConv, was integrated to expand the receptive field. Global contextual cues were captured to extract the initial feature at a more robust stage. Additionally, a Convolutional Block Attention Module (CBAM) was added after the modified Bottleneck to further refine feature selection. The CBAM was sequentially applied to the channel and spatial attention in order to automatically focus on the target regions and effectively suppress background noise and interference. The RFAM-C2f module was used to accurately extract the effective features of the mango fruits and peduncles. Secondly, a bidirectional feature pyramid network (BiFPN) was introduced into the feature pyramid network in order to improve the detection accuracy of small target peduncles. Conventional networks of unidirectional feature pyramids often suffered from insufficient information flow during multi-scale feature fusion. A bidirectional transmission, along with learnable weighted coefficients, was employed to adaptively integrate the features at the different scales. The low-level information was well fused with the high-level semantic information in order to effectively mitigate the loss of feature information due to the small size of the targets. Experimental results demonstrated that the incorporation of BiFPN significantly improved both recall and precision for the small peduncle detection. Finally, the neck network was optimized to introduce the partial convolution (PConv). A lightweight module, PConv-C2f, was formed for the localized partial convolution, thereby reducing unnecessary computations and memory accesses while maintaining effective feature extraction. A robust extraction of some feature was obtained to decrease the computational complexity and parameter count. The MAL-YOLOv10n significantly outperformed the original YOLOv10n model at various metrics. Specifically, the improved model achieved a precision of 94.9%, a recall of 89.7%, and a mean average precision (mAP) of 95.5%, indicating the improvements of 3.1%, 3.3%, and 2.5% over YOLOv10n, respectively. The detection was achieved at a frame rate of 119.6 frames per second. In light weighting, MAL-YOLOv10n reduced the floating-point operations, parameter count, and model size by 12%, 3.7%, and 8.6%, respectively. Furthermore, the superior performance of the MAL-YOLOv10n was achieved in complex scenarios and small target detection, compared with the mainstream object models, including Faster-RCNN, SSD, YOLOv5s, YOLOv7-tiny, YOLOv8n, YOLOv8s, YOLOv10n, YOLOv10s, YOLO11s, YOLOv12n, and RT-DETR. In summary, an optimal balance between detection speed and accuracy was obtained for the exceptional robustness under complex environmental conditions. The finding can also provide valuable technical support for mango harvesting in challenging natural scenes.