基于改进DeepSeek的异步成熟设施番茄采摘决策方法

袁帅; 江丹; 饶元; 王坦; 徐子悦; 吴康磊

doi:10.11975/j.issn.1002-6819.202510004

基于改进DeepSeek的异步成熟设施番茄采摘决策方法

Harvesting decision-making on asynchronously ripened fruits of facility-grown tomato using improved DeepSeek

摘要

摘要: 为精准实现设施栽培场景下异步成熟番茄的自动化采摘决策，该研究提出一种融合增强视觉提取与动态知识大语言模型的采摘管理决策方法（DeepSeek-VKQ）。首先，在YOLOv11n中引入全局注意力机制（global attention mechanism, GAM）和SIoU损失函数作为视觉模块，缓解复杂栽培场景下果实检测过程中因遮挡及背景干扰所引发的成熟度判定难题；其次，设计视觉信息结构化转换模块，将目标位置量化为方位描述，并结合置信度与成熟度标签构建非线性评分函数生成连续化成熟度指标，实现检测结果到决策语义的跨模态映射；最后，构建含农艺知识与实时环境数据的动态知识库，结合思维链（chain of thought, CoT）引导的多模态推理机制，实现视觉特征与领域知识的协同决策。试验结果表明：改进后的视觉模块检测平均精度均值mAP_0.5:0.95达87.6%，相较于YOLOv13n提高了3.2个百分点。DeepSeek-VKQ模型的决策精确率、召回率和F1分数分别达88.4%、91.7%和90.0%，较基准模型分别提升21.0、18.0和19.6个百分点；在仅7B参数量的轻量化设计下，该模型在番茄采摘任务中的决策性能超越Qwen2.5-VL（72B）、DeepSeek-VL2等开源模型，多项指标接近多模态闭源大模型GPT-4o，实现模型规模轻量化与决策表现的良好平衡。该方法可为设施番茄场景下机器人采摘决策系统研发提供有效技术支撑。

Abstract: An accurate planning of the harvesting priority is often required for the asynchronous maturity in the large-scale facility-grown tomatoes. However, the asynchronously ripening has posed a great challenge on the harvest scheduling and fruit quality. In this study, a harvesting decision-making framework (DeepSeek-VKQ) was proposed for the asynchronously ripened fruits of the facility-grown tomato using DeepSeek-7B. Enhanced visual perception, structured knowledge retrieval, and logical reasoning were synergistically combined in the framework. Firstly, a comprehensive corpus was constructed using the basic knowledge of the tomato cultivation, ripening stage, and pest/disease strategies. Multi-source data was also integrated from the technical manuals, academic literature, web resources, and expert experience. The 7,000 structured Q&A pairs were developed for the knowledge reasoning. Architectural refinements were made on the YOLOv11n backbone network, in order to enhance the visual feature extraction. Global attention mechanism (GAM) was effectively suppressed the foliar background interference. While the conventional CIoU loss was replaced with the spatial SIoU loss, in order to significantly enhance the bounding box regression accuracy under occlusion scenarios. Structured transformation was involved the spatial discretization of the fruit coordinates, and the fusion of the detection confidence with the maturity levels into a continuous ripeness index ranging from 0 to 1.0. Non-linear weight function was used to modulate the probability flow between adjacent maturity levels. Discrete detection outputs were transformed into the continuous index, where the confidence scores were integrated with the maturity labels to indicate the gradual ripening. Effective mapping was realized from the detection to decision semantics. Knowledge reasoning was relied mainly on the dynamic knowledge base. Multi-source textual knowledge was mapped into a low-dimensional semantic space. As such, the feature vectors were generated using the BGE-M3 semantic embedding model. Vector database was also employed to store these vectors. Efficient retrieval was then and facilitated to link the vector indexes into their original knowledge. The key environmental parameters were real-time acquired using API interfaces of meteorological platforms. Precise semantic matching was provided for the chain of thought (CoT) decomposition-guided reasoning. Ultimately, there was the deep integration between large language model (LLM) and dynamically updated agronomic-meteorological knowledge. Experiments were also validated on the annotated tomato images. It was found that the visual extractor was achieved 87.6% mean average precision (mAP) at 0.5-0.95 IoU thresholds, indicating 2.5, 3.2, and 2.9 percentage points over YOLOv12n, YOLOv13n, and RT-DETRv2, respectively, with the inference time of 10.2 ms. The exceptional performance of the framework was achieved in the decision-making tasks during tomato harvesting. The better performance was obtained with the precision, recall, and F1-score of 88.4%, 91.7%, and 90.0%, respectively. Compared with the original DeepSeek-7B model, these metrics were significantly improved by 21.0, 18.0 and 19.6 percentage points, respectively. Ablation experiments showed that there were the F1-score contributions of 7.8 percentage points from the vision module, 6.6 percentage points from the knowledge retrieval, and 3.6 percentage points from the CoT decomposition. Their contributions were accounted for the overall performance. Compared with the benchmarks, the 7B-parameter DeepSeek-VKQ substantially outperformed several larger open-source multimodal models, thereby exceeding GLM-4V-9B, InternLM3 (20B), Qwen2.5-VL (72B), and DeepSeek-VL2 (27.5B) by 16.1, 17.2, 10.8, and 12.6 percentage points in the F1-score, respectively. Notably, the performance even approached that of the leading closed-source multimodal models, with an F1-score of 90.0% trailing the 90.9% of GPT-4o by a marginal 0.9 percentage points, a recall rate of 91.7% surpassing GPT-4o’s 90.0%, and a precision of 88.4% in a narrow gap with GPT-4o’s 91.8%. Importantly, all of these were achieved with a fraction of the parameter scale. Task-specific evaluations were maintained the hallucination rates below 6.5% over all subtasks. Furthermore, the third-party large language model was evaluated the framework performance over diverse tasks. Thereby the reliability was enhanced in the practical applications. Cross-modal perception, knowledge retrieval, and logical reasoning were integrated to enhance the framework for the high precision of the tomato harvest decision-making. The finding can also provide the effective technical support for the robotic harvesting decision-making in the facility-grown tomato cultivation.

HTML全文

参考文献(32)

施引文献

资源附件(0)