Abstract
High accuracy and low computational complexity can be required to detect the tea pest and disease in modern agriculture. In this study, a lightweight and high-precision model, called WGSE-YOLOv11n, was proposed using an improved YOLOv11n architecture. A tea pest and disease dataset was also constructed to support the model training and validation. The dataset comprised 9 categories (healthy leaves, tea anthracnose, tea leaf spot disease, tea black rot, tea leaf rust, tea leaf blight, tea white spot disease, tea aphids, and tea spider mites), totaling 2,496 sample images. Two parts of the data sources consisted of: One part comprised 746 valid images that were captured at the Yunfeng Ecological Tea Garden in Hanzhong, Shaanxi Province, China, in 2025. Another part consisted of 1,750 images that were sourced from the Roboflow public dataset. All images were uniformly resized to a resolution of 640×640. Label Studio was used to annotate the disease regions in the images and then convert them into YOLO format. Simultaneously, Mosaic data augmentation (including translation, rotation, scaling, brightness adjustment, and noise injection) operations were applied to expand the sample size to 6,150 images. The dataset was ultimately divided into the training (4,296 images), validation (1,235 images), and test (619 images) sets at a ratio of 7:2:1. In the WGSE-YOLOv11n model design, firstly, the Wavelet-Gaussian Dynamic Convolution (WGD) architecture was developed in the backbone network. The frequency-domain decomposition of the WaveletPool module was integrated with the lightweight properties of the C3k2-GhostDynamic module. Feature representation was enhanced to reduce the parameter complexity, thereby improving the capture of multi-scale lesion features. Secondly, the neck network was incorporated with a Wavelet Pooling Star Fusion (WSF) architecture. WaveletUnPool upsampling and StarFusion feature enhancement modules were combined to effectively reconstruct the lesion structural and texture information, thereby improving the recognition accuracy for the minute lesions and complex boundaries. Finally, a lightweight EfficientHead detection module was introduced. Channel grouping via the GroupConv architecture further reduced the parameters and computational complexity. Grad-CAM was employed to generate the heatmaps for the tea leaf pest and disease detection, in order to validate the performance of the detection. Results show that the WGSE-YOLOv11n model accurately and rapidly located the lesions, whose heatmaps exhibited the high-response zones in the pathological tissue boundaries, indicating the strong spatial coupling with the lesion morphology. In contrast, the heatmaps of the YOLOv11n baseline model shared significant spatial diffusion, indicating the weak responses in the diseased areas, edge attenuation, and false activation of non-pathological tissues. Edge response intensity was significantly enhanced in the WGSE-YOLOv11n model, while the false-positive activations were reduced substantially. There was no feature confusion in the multi-leaf scenarios. The better performance of the improved model also outperformed that of the baseline, particularly in the response intensity toward the minute targets, like the ring lesions of the tea cake disease and tea aphids. There were great improvements in the pathological feature focus, edge precision, and multi-class robustness. The improved model was also deployed on a Jetson Orin NX development board that connected to a D455 camera for real-time image capture. The practical performance on the mobile devices was further validated after deployment. Among them, TensorRT was integrated for operator acceleration and INT8 quantization. CUDA was then utilized for the multithreaded parallel preprocessing. The computational efficiency enhanced the detection rate, in order to prevent computational constraints on the embedded platforms. Experimental results demonstrate that the WGSE-YOLOv11n model achieved 97.64% precision, 97.87% recall, and 99.08% mean average precision on the self-built dataset, respectively, indicating the improvements of 0.24%, 3.62%, and 0.77%, respectively, over the baseline YOLOv11n model. Parameter count, computational load, and model size were compressed to 1.51 million parameters, 3.3 gigabytes of computational load, and 3.2 megabytes, respectively, which were significantly reduced by 41.5%, 47.6%, and 39.6%, respectively, compared with the baseline model. On Jetson Orin NX, the detection frame rate reached 246.14 frames per second, with an average inference time of 14ms per image. The high accuracy of the recognition was obtained to substantially reduce the computational load, suitable for the deployment on mobile and embedded devices, as well as the real-time detection in field environments.