基于轻量级大语言模型的茶叶智能问答方法

裴国权; 钱雪英; 周兵; 王白娟; 刘自高; 吴文斗

doi:10.11975/j.issn.1002-6819.202503012

基于轻量级大语言模型的茶叶智能问答方法

Tea intelligent question-answering system using lightweight large language models

摘要

摘要: 通用大语言模型在茶叶智能问答垂直领域存在专业性不足、行业知识覆盖不全面、部署成本高等问题，该研究提出了一种轻量级茶叶智能问答模型ChatTea。研究以茶叶种植、加工及病虫害防治等领域的专业知识为基础，结合通用大语言模型和人工半监督方法构建高质量问答语料库，并在Meta-LLaMA-3.1-8B-Instruct模型上通过低秩矩阵微调训练ChatTea模型。试验结果显示，ChatTea在双语评估得分（BLEU-4）、单字重叠率（ROUGE-1）、双字重叠率（ROUGE-2）和最长公共子序列重叠率（ROUGE-L）四项评估指标上分别达到21.73%、43.68%、21.33%和37.26%，相较基座模型分别提升了18.58、26.14、17.22、27.69个百分点，每秒训练步数从2.90提升至18.40。ChatTea在提升专业问答能力的同时兼具轻量化特性，为茶叶及其他农作物领域的智能问答方法构建提供了思路。

Abstract: General-purpose large language models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, some challenges of LLMs have often remained, such as the domain-specific knowledge, the coverage of agricultural expertise, and high computational costs. Their practical deployment has often limited in the resource-constrained environments, especially in the tea industry. This study aims to specifically develop the ChatTea for the tea industry, i.e., a lightweight and domain-tailored intelligent question-answering system. The accurate, efficient, and cost-effective question answering was realized on the critical topics, such as the tea cultivation, processing techniques, as well as the pest and disease agronomic practices. A high-quality corpus of the tea domain was constructed to fine-tune the language model. Multiple authoritative sources were assembled and then integrated, including over ten tea books that authored by experts, two hundred peer-reviewed scientific articles, and three hundred news reports that published within the last five years. Over ten million Chinese characters were then collected in total. The diverse dataset was also covered the multiple subfields, such as the tea variety, garden construction and maintenance, pest control strategies, and post-harvest processing. Multi-stage cleaning and validation were carried out to correct and remove the inaccuracies, irrelevant or low-quality content. Terminological consistency was also verified after processing. The strong domain was constructed using high-frequency word analysis and word cloud visualization. The key concepts were highlighted for the word relevance and depth, like “tea leaves,” “tea culture,” “quality control,” and “pest management”. Meta-LLaMA-3.1-8B-Instruct base was utilized to maximize the effectiveness of the ChatTea for the high computational efficiency. Low-Rank Adaptation (LoRA) was employed as the fine-tuning strategy. The LoRA was introduced two low-rank parameter matrices alongside the pre-trained weight matrices. These parameters were then updated only during training. The number of the tunable parameters was significantly reduced to accelerate the training and inference. The improved model was suitable for the deployment on the edge devices in the environments with the limited computational resources. The domain-specific data was integrated after LoRA fine-tuning, in order to effectively align the knowledge base with the tea industry expertise. The commonly-observed hallucination was reduced in the generic LLMs. An experiment was conducted on the test dataset of the tea domain. The superior performance of the ChatTea was achieved over the original base model and several mainstream open-source LLMs with the comparable parameter sizes, such as Qwen-7B-chat, DeepSeek-7B, and XuanYuan-6B-chat. Specifically, the ChatTea was achieved in a BLEU-4 score of 21.73%, ROUGE-1 of 43.68%, ROUGE-2 of 21.33%, and ROUGE-L of 37.26%, with the absolute improvements of 18.58%, 26.14%, 17.22%, and 27.69%, respectively, compared with the base model. The substantial performance was realized to generate the coherent, contextually relevant, and semantically accurate responses. Furthermore, the ChatTea also processed 18.40 inference steps/s, thus exceeding the base model’s 2.90 steps/s, indicating the enhanced computational efficiency. The comparable open-source models were obtained to balance the high precision and speed in the specialized tea domain. In addition to its strong quantitative performance, the architecture and training of the ChatTea were also offered a replicable framework for the large language models. Expert-curated domain knowledge was also combined with the parameter-efficient fine-tuning techniques. There was the scalable and cost-effective path toward the domain-specific LLM development. The lightweight ChatTea was facilitated the deployment in the real-world agricultural scenarios, including the on-device applications and real-time interactive systems in precision agriculture and smart farming. In conclusion, the ChatTea demonstrated that the domain-focused corpus construction and innovative fine-tuning can effectively enhance the large language models for the specialized tasks. The high accuracy, strong domain relevance and inference speed can greatly contribute to the intelligent question-answering system in agriculture, particularly in the tea industry. Multimodal data sources can be utilized to tailor the language models in the cross-domain applications.

HTML全文

参考文献(31)

施引文献

资源附件(0)