Abstract:
Grapes can be one type of the fruit with the widest cultivation area, the highest yield, and extremely high economic value in China. Among them, bagging techniques can be often employed to reduce the impact of pests and diseases on the grape quality during the harvest period. An accurate yield estimation can greatly contribute to the plan picking, sales, and storage, in order to reduce the economic losses caused by supply-demand mismatches. The accurate counting of bagged grapes can be required before yield estimation. However, the existing fruit counting can usually suffer from insufficient real-time detection and tracking failure, due to the occlusion of bagged grapes and unprocessed detection noise. In this study, video counting was proposed for the bagged grapes using an improved YOLOv9s and adaptive Kalman filter. Three modules were included: the improved YOLOv9s detection model, an adaptive Kalman filter tracking algorithm, and a line-drawing counting. In detection, the original RepNCSPELAN4 module in YOLOv9s was replaced with an efficient feature enhancement module (EFEM), in order to reduce the number of model parameters for the inference speed. The performance of the improved YOLOv9s model was enhanced for sufficient real-time detection. The EFEM was designed to selectively learn from the partial feature maps of the bagged grapes, thereby enabling efficient feature extraction and faster inference. The FasterNet module was specifically utilized to efficiently extract the spatial features, in order to minimize the redundant computation and memory access. A spatially enhanced attention module (SEAM) was introduced to further improve the detection performance under occlusion conditions. The SEAM was used to learn the relationship between occluded and unoccluded areas. The occluded features were predicted and compensated to thereby improve the detection accuracy of bagged grapes under full and partial occlusion. In tracking, an adaptive Kalman filter algorithm was proposed to reduce the detection noise caused by camera shake and rapid movement. The accuracy of Kalman filter trajectory prediction was promoted after tracking. Noise estimation was automatically adjusted, according to the detection confidence. A line-drawing counting was used for the real-time counting of bagged grapes; Once the center of the bagged grape was collided with a virtual counting line, the number of bagged grapes increased by one. The experimental dataset was collected from the PaiDengTe Technology Demonstration Park in Bishan District, Chongqing, China. There were 700 original images of bagged grapes and six video clips. The dataset was randomly divided into a training set of 490 images, a validation set of 140 images, and a test set of 70 images at a ratio of 7∶2∶1. The six video clips were used to test the counting performance. Some image enhancement techniques were applied to the training set during training, such as saturation adjustment, brightness variation, image mirroring, and Gaussian noise addition, thereby expanding the training set to 2100 images. The robustness and generalization of the detection model were enhanced after enhancement. Experimental results show that the best performance of the improved YOLOv9s model (ES-YOLOv9s) outperformed five other models. The highest mean average precision and recall were 96.9% and 93.1%, respectively, while there was an inference speed of 70 frames per second. Compared with the original YOLOv9s, the ES-YOLOv9s reduced the number of parameters by 29.6%, and the number of floating-point operations decreased by 10.9G, whereas, the frame rate was improved by 20 frames per second. In terms of tracking performance, the adaptive Kalman filter tracking algorithm achieved 58.6%, 63.6%, and 78.8% in the higher-order tracking accuracy, multi-object tracking accuracy, and ID harmonic mean metrics, respectively, thus representing improvements of 4.3, 2.2, and 2.5 percentage points over ByteTrack. In terms of counting performance, line-drawing counting was achieved with an average accuracy of 80.0%, compared with manual counting. In conclusion, the video counting of bagged grapes with the improved YOLOv9s and Kalman filter also demonstrated better application potential in real-time tracking and counting. The finding can provide technical support for the pre-harvest yield estimation of bagged grapes.