Abstract:
A precise identification is often required to reduce the obstruction by the branches and leaves in an orchard. In this study, an accurate and rapid model was proposed to detect citrus fruits under complex environments using the improved YOLOv11n. An IRSC (inverted residual-attention shiftwise convolution) module was used to improve the C3k2 (cross-stage partial with kernel-size 2) module of the original backbone network. A spatial weighting mechanism was adopted with a morphological prior Gaussian bias in order to enhance the feature response of the key parts. More attention was focused on the center of the target region in order to fully utilize the approximately round morphological features of the citrus targets. Important regions were emphasized to better focus on the critical aspects of the citrus fruits, even when they were occluded by the branches and leaves. Meanwhile, an inverted residual design was combined to expand the receptive field of the weak features. More contextual information was also captured for the small targets and weak features under the low-light conditions. Moreover, the IRSC module also broke through the local receptive field of the small convolution kernels. The network architecture was then realized to effectively simulate the global context modelling of the large-kernel convolution. The more targeted performance of the detection was improved significantly, compared with the original network structure. The feature extraction was suitable for the low-light images. The Retinexformer (one-stage retinex-based transformer) module was utilized for the low-light enhancement of the network. Strong multi-scale illumination decomposition was obtained for the end-to-end enhancement of the dark areas. The illumination was then decomposed at multiple scales. The better performance was achieved to accurately enhance the dark regions, where citrus fruits were located, and the feature extraction was greatly enhanced after optimization. In the case of the underexposure, the object detection was often required to extract the features of the citrus fruits in orchards, due to the low contrast of the image, detailed contours, and uneven illumination. This integration effectively improved the contrast and illuminance of the image with less noise. The better feature extraction of the citrus fruits improved the accuracy of the object detection. Furthermore, the ADown module was employed to replace some common convolutions during downsampling. Multi-branch parallel processing and feature fusion were designed to extract the features from the different branches and then effectively fuse them. The model complexity was reduced the number of parameters for high accuracy. The results show that the average accuracy of the improved YOLOv11n model reached an mAP@0.5 of 87.1%, with a recall rate of 79.1%, in the citrus detection under the complex environments. Compared with the original model, the mAP@0.5 and recall rate increased by 1.9 and 3.0 percentage points, respectively. Additionally, the number of model parameters and size was reduced by 8.5% and 4.0%, respectively. Ablation studies demonstrate that each module contributes significantly to the overall performance improvement: the Retinexformer module alone increases mAP@0.5 by 1.1 percentage points, the C3k2-IRSC module contributes an additional 0.6 percentage points, and the combination of Retinexformer and C3k2-IRSC achieves 1.5 percentage points improvement. The ADown module reduces parameters by 17.4% when used alone, while the final model with all three modules achieves an 8.5% parameter reduction. An effective collaborative mechanism was formed: Retinexformer provided standardized input through image enhancement, IRSC extracted global contextual features based on the enhanced input, and ADown balanced feature preservation and compression while controlling parameter growth. It achieves the best balance between accuracy and efficiency, with the collaborative mechanism providing synergistic gains beyond the sum of individual module contributions. This collaborative design enables the detector to cope with underexposure, heavy occlusion, and complex backgrounds in orchards within a unified framework. In comparison with mainstream object detection models, including Faster R-CNN, SSD, EfficientDet, YOLOv5n, YOLOv8n, YOLOv10n, YOLOv12n, and RT-DETR-l, the proposed model attains higher mAP@0.5 by 1.8 to 25.6 percentage points, with parameter reductions ranging from 5.6% to 92.6%, indicating clear advantages in both detection accuracy and model compactness. The high accuracy was effectively achieved in detecting the citrus in the complex environments. The finding can also provide valuable technical references for the picking of the citrus in a more efficient and intelligent fruit industry.