Pesticide molecule generation model based on latent diffusion sampling and fragment heterogeneous graph
-
Graphical Abstract
-
Abstract
Pesticide molecular structure is one of the most primary drivers for the pest resistance in sustainable agriculture. Conventional computer-aided drug design can often rely on the restricted chemical libraries and human expertise. It is still lacking on in the exploration of the extensive chemical space. Deep generation models can be expected for to have considerable potential in de novo design. However, the existing frameworks can be limited to the suboptimal alignment with target-specific biological activities. It is often required for structural stability and sufficient chemical diversity. In this study, an advanced computational paradigm was developed to integrate multi-scale structural representation and latent space optimization. Pesticide candidates were then generated to fully meet the high chemical rationality and potent bioactivity. A molecular generation model was developed to integrate the latent diffusion sampling with the fragment heterogeneous graphs. A Heterogeneous Graph Neural Network (HGNN) was constructed to synergize with a Variational Autoencoder (VAE). The atomic-level topology and fragment-level semantics were integrated into a unified latent space. Molecular fragmentation was performed to verify the chemical space using the Breaking of Retrosynthetically Interesting Chemical Substructures (BRICS) algorithm. Three Graph Convolutional Network (GCN) layers of the encoder were used with a hidden dimension of 300 to process atomic, bond, pharmacophore, and reaction features. Subsequently, a one-dimensional U-Net architecture was employed with six residual layers in the Latent Diffusion Model (LDM). A "noising-denoising" mechanism was implemented using 1,000 training steps, effectively alleviating the mode collapse. Finally, a Prefix-tuning strategy was integrated into a four-head Transformer-based decoder for the conditional generation toward specific targets, including insect Ac-etylcholinesterase (AChE) and plant Acetolactate Synthase (ALS). Systematic evaluations demonstrated that the superior performance of the framework was achieved after optimization. In the AChE-targeted generation task, the molecules achieved a validity rate of 100.00%, a novelty rate of 100.00%, and a uniqueness rate of 98.50%. On benchmark datasets QM9 and ZINC, the New/Sample metrics of 64.2% and 98.5%, respectively, significantly outperformed the baselines, such as the MolGPT and GeoBFN. Ablation studies confirmed that the synergistic modeling of atom- and fragment-level views was essential to capture the fine-grained topology and high-order semantics. The distributions of physicochemical properties showed the high consistency with real-world pesticides, including the Logarithm of Partition Coefficient (LogP), Topological Polar Surface Area (TPSA), and Molecular Weight (MW). Molecular docking revealed that 62.81% of the candidates also exhibited a binding affinity lower than -7.0 kcal/mol with the AChE protein (PDB: 6XYU). Furthermore, the interaction patterns were produced with the essential residues, such as the Glutamic Acid 485 and Tyrosine 498, with the hydrogen bond lengths from 2.4 to 3.3 Å. Prefix-tuning was required only 8,576 trainable parameters, significantly reducing the training time with the less overfitting. The multi-scale representation and latent diffusion were integrated to enhance the molecular diversity. The framework was effectively captured the target-specific structure-activity relationships for the high chemical rationality. This finding can provide a scalable tool for the targeted bioactive molecule design. Data scarcity can be alleviated to accelerate the discovery of environmentally friendly agrochemicals.
-
-