Abstract:
An accurate planning of the harvesting priority is often required for the asynchronous maturity in the large-scale facility-grown tomatoes. However, the asynchronously ripening has posed a great challenge on the harvest scheduling and fruit quality. In this study, a harvesting decision-making framework (DeepSeek-VKQ) was proposed for the asynchronously ripened fruits of the facility-grown tomato using DeepSeek-7B. Enhanced visual perception, structured knowledge retrieval, and logical reasoning were synergistically combined in the framework. Firstly, a comprehensive corpus was constructed using the basic knowledge of the tomato cultivation, ripening stage, and pest/disease strategies. Multi-source data was also integrated from the technical manuals, academic literature, web resources, and expert experience. The 7,000 structured Q&A pairs were developed for the knowledge reasoning. Architectural refinements were made on the YOLOv11n backbone network, in order to enhance the visual feature extraction. Global attention mechanism (GAM) was effectively suppressed the foliar background interference. While the conventional CIoU loss was replaced with the spatial SIoU loss, in order to significantly enhance the bounding box regression accuracy under occlusion scenarios. Structured transformation was involved the spatial discretization of the fruit coordinates, and the fusion of the detection confidence with the maturity levels into a continuous ripeness index ranging from 0 to 1.0. Non-linear weight function was used to modulate the probability flow between adjacent maturity levels. Discrete detection outputs were transformed into the continuous index, where the confidence scores were integrated with the maturity labels to indicate the gradual ripening. Effective mapping was realized from the detection to decision semantics. Knowledge reasoning was relied mainly on the dynamic knowledge base. Multi-source textual knowledge was mapped into a low-dimensional semantic space. As such, the feature vectors were generated using the BGE-M3 semantic embedding model. Vector database was also employed to store these vectors. Efficient retrieval was then and facilitated to link the vector indexes into their original knowledge. The key environmental parameters were real-time acquired using API interfaces of meteorological platforms. Precise semantic matching was provided for the chain of thought (CoT) decomposition-guided reasoning. Ultimately, there was the deep integration between large language model (LLM) and dynamically updated agronomic-meteorological knowledge. Experiments were also validated on the annotated tomato images. It was found that the visual extractor was achieved 87.6% mean average precision (mAP) at 0.5-0.95 IoU thresholds, indicating 2.5, 3.2, and 2.9 percentage points over YOLOv12n, YOLOv13n, and RT-DETRv2, respectively, with the inference time of 10.2 ms. The exceptional performance of the framework was achieved in the decision-making tasks during tomato harvesting. The better performance was obtained with the precision, recall, and F1-score of 88.4%, 91.7%, and 90.0%, respectively. Compared with the original DeepSeek-7B model, these metrics were significantly improved by 21.0, 18.0 and 19.6 percentage points, respectively. Ablation experiments showed that there were the F1-score contributions of 7.8 percentage points from the vision module, 6.6 percentage points from the knowledge retrieval, and 3.6 percentage points from the CoT decomposition. Their contributions were accounted for the overall performance. Compared with the benchmarks, the 7B-parameter DeepSeek-VKQ substantially outperformed several larger open-source multimodal models, thereby exceeding GLM-4V-9B, InternLM3 (20B), Qwen2.5-VL (72B), and DeepSeek-VL2 (27.5B) by 16.1, 17.2, 10.8, and 12.6 percentage points in the F1-score, respectively. Notably, the performance even approached that of the leading closed-source multimodal models, with an F1-score of 90.0% trailing the 90.9% of GPT-4o by a marginal 0.9 percentage points, a recall rate of 91.7% surpassing GPT-4o’s 90.0%, and a precision of 88.4% in a narrow gap with GPT-4o’s 91.8%. Importantly, all of these were achieved with a fraction of the parameter scale. Task-specific evaluations were maintained the hallucination rates below 6.5% over all subtasks. Furthermore, the third-party large language model was evaluated the framework performance over diverse tasks. Thereby the reliability was enhanced in the practical applications. Cross-modal perception, knowledge retrieval, and logical reasoning were integrated to enhance the framework for the high precision of the tomato harvest decision-making. The finding can also provide the effective technical support for the robotic harvesting decision-making in the facility-grown tomato cultivation.