高级检索+

基于潜在扩散采样与片段异构图的农药分子生成模型

Pesticide molecule generation model based on latent diffusion sampling and fragment heterogeneous graph

  • 摘要: 近年来,深度生成模型在农药发现和从头分子设计方面展现出巨大潜力,但现有方法常面临生成分子结构碎片化、多样性不足以及难以兼顾特定靶点活性的挑战。为解决上述问题,该研究提出了一种基于潜在扩散采样与片段异构图融合的农药分子生成模型。首先,构建异构图神经网络协同变分自编码器,将分子的原子级拓扑与片段级语义映射至统一的潜在空间;其次,引入潜在扩散模型,通过加噪-去噪的迭代优化机制缓解生成模型的模式坍塌问题;最后,结合前缀微调策略,针对昆虫乙酰胆碱酯酶(acetylcholinesterase,AChE)和植物乙酰乳酸合成酶(acetolactate synthase,ALS)两类典型农药靶点进行定向优化生成。试验结果表明,在AChE靶点生成任务中,该模型生成分子的有效性、新颖性和唯一性分别达到了100.00%、100.00%和98.50%,此外,生成分子在脂水分配系数(logarithm of partition coefficient,LogP)、拓扑极性表面积(topological polar surface area,TPSA)、相对分子质量(molecular weight,MW)等理化性质上的分布与真实农药分子高度一致。分子对接结果显示,62.81%的生成分子与AChE靶标蛋白(PDB: 6XYU)的结合亲和力低于−7.0 kcal/mol,且成功复现了与关键氨基酸残基(如 GLU-485、TYR-498)的相互作用模式。该方法能够高效生成结构合理、性质优良且具有潜在生物活性的候选农药分子,为突破农药研发数据稀缺瓶颈与加速新药创制提供了计算范式。

     

    Abstract: Pesticide molecular structure is one of the most primary drivers for the pest resistance in sustainable agriculture. Conventional computer-aided drug design can often rely on the restricted chemical libraries and human expertise. It is still lacking on in the exploration of the extensive chemical space. Deep generation models can be expected for to have considerable potential in de novo design. However, the existing frameworks can be limited to the suboptimal alignment with target-specific biological activities. It is often required for structural stability and sufficient chemical diversity. In this study, an advanced computational paradigm was developed to integrate multi-scale structural representation and latent space optimization. Pesticide candidates were then generated to fully meet the high chemical rationality and potent bioactivity. A molecular generation model was developed to integrate the latent diffusion sampling with the fragment heterogeneous graphs. A Heterogeneous Graph Neural Network (HGNN) was constructed to synergize with a Variational Autoencoder (VAE). The atomic-level topology and fragment-level semantics were integrated into a unified latent space. Molecular fragmentation was performed to verify the chemical space using the Breaking of Retrosynthetically Interesting Chemical Substructures (BRICS) algorithm. Three Graph Convolutional Network (GCN) layers of the encoder were used with a hidden dimension of 300 to process atomic, bond, pharmacophore, and reaction features. Subsequently, a one-dimensional U-Net architecture was employed with six residual layers in the Latent Diffusion Model (LDM). A "noising-denoising" mechanism was implemented using 1,000 training steps, effectively alleviating the mode collapse. Finally, a Prefix-tuning strategy was integrated into a four-head Transformer-based decoder for the conditional generation toward specific targets, including insect Ac-etylcholinesterase (AChE) and plant Acetolactate Synthase (ALS). Systematic evaluations demonstrated that the superior performance of the framework was achieved after optimization. In the AChE-targeted generation task, the molecules achieved a validity rate of 100.00%, a novelty rate of 100.00%, and a uniqueness rate of 98.50%. On benchmark datasets QM9 and ZINC, the New/Sample metrics of 64.2% and 98.5%, respectively, significantly outperformed the baselines, such as the MolGPT and GeoBFN. Ablation studies confirmed that the synergistic modeling of atom- and fragment-level views was essential to capture the fine-grained topology and high-order semantics. The distributions of physicochemical properties showed the high consistency with real-world pesticides, including the Logarithm of Partition Coefficient (LogP), Topological Polar Surface Area (TPSA), and Molecular Weight (MW). Molecular docking revealed that 62.81% of the candidates also exhibited a binding affinity lower than -7.0 kcal/mol with the AChE protein (PDB: 6XYU). Furthermore, the interaction patterns were produced with the essential residues, such as the Glutamic Acid 485 and Tyrosine 498, with the hydrogen bond lengths from 2.4 to 3.3 Å. Prefix-tuning was required only 8,576 trainable parameters, significantly reducing the training time with the less overfitting. The multi-scale representation and latent diffusion were integrated to enhance the molecular diversity. The framework was effectively captured the target-specific structure-activity relationships for the high chemical rationality. This finding can provide a scalable tool for the targeted bioactive molecule design. Data scarcity can be alleviated to accelerate the discovery of environmentally friendly agrochemicals.

     

/

返回文章
返回