色谱 ›› 2025, Vol. 43 ›› Issue (8): 959-970.DOI: 10.3724/SP.J.1123.2024.12008

• 研究论文 • 上一篇    下一篇

新污染物诱导12种细胞核受体相关活性的机器学习预测模型

李建青1,#, 王天勤1,#, 滕跃发2, 郭磊1, 黄杨1,*(), 李斐2,*()   

  1. 1.鲁东大学化学与材料科学学院,山东 烟台 264025
    2.中国科学院海岸带环境过程与生态修复重点实验室(烟台海岸带研究所),山东省海岸带环境过程重点实验室,中国科学院烟台海岸带研究所,山东 烟台 264003
  • 收稿日期:2024-12-13 出版日期:2025-08-08 发布日期:2025-07-28
  • 通讯作者: E-mail:huangyang@ldu.edu.cn(黄杨);E-mail:fli@yic.ac.cn(李斐).
  • 作者简介:第一联系人:#共同第一作者
  • 基金资助:
    国家自然科学基金(22406080);国家自然科学基金(22376215);泰山学者工程(tsqn202312275);山东省自然科学基金(ZR2024QB094);山东省科技型中小企业创新能力提升项目(2024TSGC0504);山东省大学生创新创业训练计划项目(S202410451034)

Machine learning prediction model for emerging pollutants-induced activities of 12 nuclear receptors

LI Jianqing1,#, WANG Tianqin1,#, TENG Yuefa2, GUO Lei1, HUANG Yang1,*(), LI Fei2,*()   

  1. 1. College of Chemistry and Material Science,Ludong University,Yantai 264025,China
    2. Key Laboratory of Coastal Environmental Processes and Ecological Restoration,Chinese Academy of Sciences(Yantai Institute of Coastal Research),Key Laboratory of Coastal Environmental Processes of Shandong Province,Yantai Institute of Coastal Research,Chinese Academy of Sciences,Yantai 264003,China
  • Received:2024-12-13 Online:2025-08-08 Published:2025-07-28
  • Supported by:
    National Natural Science Foundation of China(22406080);National Natural Science Foundation of China(22376215);Taishan Scholars Program(tsqn202312275);Natural Science Foundation of Shandong Province(ZR2024QB094);Shandong Province Science and Technology Small and Medium-sized Enterprise Innovation Capability Enhancement Project(2024TSGC0504);Shandong Provincial College Student Innovation and Entrepreneurship Training Program(S202410451034)

摘要:

合成化合物在生产生活中被广泛使用,并不可避免地进入环境成为潜在的新污染物,进而与人类接触,危害人体健康。为了防治潜在新污染物的健康危害,需要全面评估已经和即将进入市场的化合物的毒性。基于实验的毒性评估速度远低于新化合物进入市场的速度,且传统毒性实验不仅耗费时间与经济成本,还会在不同实验室的实验结果之间产生争议,使毒性筛查标准不一。因此,亟需开发基于人工智能、机器学习标准的高通量毒性预测模型,以高效填补化合物毒性数据空缺。本研究基于机器学习方法对Tox21数据库中各种类别化合物进行毒性预测。化合物的结构数据使用简化分子线性输入规范(SMILES)格式表示,表征物理化学性质和实验条件的信息使用RDKit库和Mordred库编码为描述符。通过Python的Sklearn库与XGBoost库计算并筛选各变量的信息增益得到新的特征集,并依此建立毒性预测模型,实现对12类细胞核受体相关活性指标的精准预测。模型在12个数据集上的平均接收者操作特性曲线下面积(AUC)为0.84,所有训练和测试集数据均位于模型应用域内。外部验证结果表明,本研究所构建模型性能优于在Tox21挑战赛中的其他模型。通过SHAP算法对模型参数进行分析,解释了毒性机理,发现log P、分子拓扑结构、ZMIC、piPC等描述符是影响活性的主要原因。为了方便具有不同学科背景的研究人员和政策制定者使用该模型,将模型开发为可视化软件,允许以SMILES格式输入化合物结构并进行毒性预测。本研究开发的预测模型及其配套软件能够快速筛查新污染物的毒性,并为新化学品的安全设计提供指导。

关键词: 新污染物, 定量构效关系, 机器学习, 生物效应

Abstract:

Emerging pollutants are substances that have recently been discovered or brought into focus, pose ecological or human-health risks, and have not yet been included in regulatory frameworks or for which existing management measures inadequately prevent and control their risks. Synthetic chemicals play key roles in progressing human society and improving quality of life. However, these chemicals may leak into the environment through unintentional or organized emissions during the life cycles of chemical-containing products, thereby becoming potential emerging pollutants and posing ecological and human-health threats. Many new chemicals are typically used without sufficient toxicity assessments; consequently, their potential threats are difficult to predict. Hence, effective toxicity assessments of existing and emerging chemicals are required to address this situation. Toxicity testing all chemicals is expected to be very time-consuming and economically expensive. In addition, there are discrepancies between experimental results from different laboratories leading to inconsistent toxicity-screening standards for emerging pollutants, which hinders preventing and controlling emerging pollutants and explaining their toxicity mechanisms. Addressing these issues requires the development of standard alternative toxicity-testing strategies that screen emerging pollutants in a high-throughput manner. In this study, machine-learning methods were used to predict the toxicities of various compounds in the Tox21 database. The RDKit and Mordred libraries were used to process structural data(presented in SMILES format) for compounds with the aim of generating molecular descriptors for their physicochemical properties. A set of refined features was screened through information-gain calculations and variable selection, and the data were fitted using Python’s Sklearn and XGBoost libraries. Prediction models were constructed based on the screened features using seven machine-learning algorithms in order to evaluate 12 different bioactive endpoints, including datasets related to endocrine disruption, DNA damage, and oxidative stress response, among others. Model performance was evaluated by calculating the accuracy of the test set, and data availability was characterized in terms of the application domain. All training and test data were found to be located in the application domain. The model was found to highly accurately predict 12 endpoints. This study clarified the relationship between the physicochemical properties of chemicals and nuclear receptor activity, and developed corresponding software tools. The model for the 12 Tox21 datasets exhibited an average area under the curve(AUC) of 0.84, and delivered better prediction performance than other participating models. Further insight into toxicological mechanisms was obtained through feature-importance analysis using Shapley Additive exPlanations(SHAPs). The octanol-water partition coefficient(log P), molecular topology, and ZMIC and piPC descriptors were identified as key parameters for predicting toxicity; these descriptors elucidate the relationship between chemical structure and biological interaction, thereby providing mechanistic explanations for compound toxicities. For example, high log P values are associated with high cell membrane permeability, which facilitates interactions between intracellular targets and endocrine receptors. The study also developed user-friendly quantitative structure-activity relationships(QSAR) prediction software. Designed for accessibility, this software enables researchers and policymakers to input compound structures in SMILES format and predict their toxicities without the need for specialized machine-learning expertise. The software automatically generates descriptors and predicts whether the input compounds are toxic or not. This study contributes to in silico methods that replace animal testing in future toxicity studies by integrating advanced machine-learning and interpretation methods. The predictive model and accompanying software enable the rapid screening of emerging pollutants and provide guidance for designing safer chemicals. These contributions are critical for advancing environmental safety and public health in the face of expanding chemical inventories.

Key words: emerging pollutants, quantitative structure-activity relationship(QSAR), machine learning, biological effects

中图分类号: