Abstract:
General-purpose large language models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, some challenges of LLMs have often remained, such as the domain-specific knowledge, the coverage of agricultural expertise, and high computational costs. Their practical deployment has often limited in the resource-constrained environments, especially in the tea industry. This study aims to specifically develop the ChatTea for the tea industry, i.e., a lightweight and domain-tailored intelligent question-answering system. The accurate, efficient, and cost-effective question answering was realized on the critical topics, such as the tea cultivation, processing techniques, as well as the pest and disease agronomic practices. A high-quality corpus of the tea domain was constructed to fine-tune the language model. Multiple authoritative sources were assembled and then integrated, including over ten tea books that authored by experts, two hundred peer-reviewed scientific articles, and three hundred news reports that published within the last five years. Over ten million Chinese characters were then collected in total. The diverse dataset was also covered the multiple subfields, such as the tea variety, garden construction and maintenance, pest control strategies, and post-harvest processing. Multi-stage cleaning and validation were carried out to correct and remove the inaccuracies, irrelevant or low-quality content. Terminological consistency was also verified after processing. The strong domain was constructed using high-frequency word analysis and word cloud visualization. The key concepts were highlighted for the word relevance and depth, like “tea leaves,” “tea culture,” “quality control,” and “pest management”. Meta-LLaMA-3.1-8B-Instruct base was utilized to maximize the effectiveness of the ChatTea for the high computational efficiency. Low-Rank Adaptation (LoRA) was employed as the fine-tuning strategy. The LoRA was introduced two low-rank parameter matrices alongside the pre-trained weight matrices. These parameters were then updated only during training. The number of the tunable parameters was significantly reduced to accelerate the training and inference. The improved model was suitable for the deployment on the edge devices in the environments with the limited computational resources. The domain-specific data was integrated after LoRA fine-tuning, in order to effectively align the knowledge base with the tea industry expertise. The commonly-observed hallucination was reduced in the generic LLMs. An experiment was conducted on the test dataset of the tea domain. The superior performance of the ChatTea was achieved over the original base model and several mainstream open-source LLMs with the comparable parameter sizes, such as Qwen-7B-chat, DeepSeek-7B, and XuanYuan-6B-chat. Specifically, the ChatTea was achieved in a BLEU-4 score of 21.73%, ROUGE-1 of 43.68%, ROUGE-2 of 21.33%, and ROUGE-L of 37.26%, with the absolute improvements of 18.58%, 26.14%, 17.22%, and 27.69%, respectively, compared with the base model. The substantial performance was realized to generate the coherent, contextually relevant, and semantically accurate responses. Furthermore, the ChatTea also processed 18.40 inference steps/s, thus exceeding the base model’s 2.90 steps/s, indicating the enhanced computational efficiency. The comparable open-source models were obtained to balance the high precision and speed in the specialized tea domain. In addition to its strong quantitative performance, the architecture and training of the ChatTea were also offered a replicable framework for the large language models. Expert-curated domain knowledge was also combined with the parameter-efficient fine-tuning techniques. There was the scalable and cost-effective path toward the domain-specific LLM development. The lightweight ChatTea was facilitated the deployment in the real-world agricultural scenarios, including the on-device applications and real-time interactive systems in precision agriculture and smart farming. In conclusion, the ChatTea demonstrated that the domain-focused corpus construction and innovative fine-tuning can effectively enhance the large language models for the specialized tasks. The high accuracy, strong domain relevance and inference speed can greatly contribute to the intelligent question-answering system in agriculture, particularly in the tea industry. Multimodal data sources can be utilized to tailor the language models in the cross-domain applications.