Abstract:
Precision viticulture can rely heavily on the acquisition of berry phenotypic data, particularly for the early grape yield estimation and fruit thinning. The object detection models are often deployed under unstructured agricultural environments. However, two challenges are severely constrained to the high detection precision under complex backgrounds and the computational limitations of embedded edge devices. It is required to effectively reconcile the competing requirements of high detection precision, low parameter count, and real-time inference speed. In this study, an improved lightweight YOLO11n (BD-YOLO) architecture was proposed to detect the grape berry during the thinning period. Three strategies were utilized to extend the BD-YOLO architecture using the baseline model. 1) In model redundancy, the original Backbone was replaced with RepViT. A lightweight convolutional neural network was designed with a Vision Transformer (ViT) architecture. Depthwise convolutions (DWConv) and structural reparameterization were used to reduce the parameter counts and floating-point operations (FLOPs) in the backbone, thus preserving fundamental feature extraction. 2) The potential precision degradation was often associated with a lightweight aggressive model. A dual-layer high-resolution structure was designed in the detection head. The sensitivity to small-scale grape berry targets was enhanced to extract the subtle background details in the module. 3) Focaler-MPDIoU loss function was introduced to consider the varying learning samples caused by mutual occlusion and the specific morphological features, where grape berry targets shared similar aspect ratios and different physical sizes. Minimum point distance (MPD) was combined with the Focaler-intersection over union (IoU) to optimize bounding box regression error. Extensive experiments verified the superiority of the improved model. On the custom dataset, BD-YOLO's parameter count, model size, and floating-point operations were drastically reduced by 89.2%, 81.8%, and 9.5%, respectively, compared with the baseline YOLO11n. Simultaneously, the mean average precision (mAP) and recall rate were substantially improved by 3.5 and 4.3 percentage points, reaching 91.8% in mean average precision, while the detection speed increased significantly from 114.2 to 142.1 frames per second (FPS). Furthermore, the BD-YOLO attained a mean average precision of 89.5% after cross-validation for the generalization on public datasets. There was no significant performance degradation compared with the custom dataset (91.8%), indicating strong robustness. Crucially, hardware deployment tests on actual edge devices indicated that the BD-YOLO also achieved a real-time inference speed of 35.6 FPS, which was improved by 26.7% over the baseline model. Computational costs were successfully minimized for high accuracy. This finding can provide a highly promising reference for the precise detection of grape berries during the thinning period. Robustness and efficiency can be expected to serve as potential candidates for the low-cost, large-scale deployment in smart orchards.