色谱 ›› 2025, Vol. 43 ›› Issue (4): 355-362.DOI: 10.3724/SP.J.1123.2024.07014

• 研究论文 • 上一篇    下一篇

不同极性固定相上气相色谱保留指数机器学习集成预测模型的构建

王芊懿, 朱永乐, 李雪花*()   

  1. 大连理工大学环境学院,工业生态与环境工程教育部重点实验室,辽宁 大连 116024
  • 收稿日期:2024-07-21 出版日期:2025-04-08 发布日期:2025-03-26
  • 通讯作者: *Tel:(0411)84706913,E-mail:lixuehua@dlut.edu.cn.

Construction of a machine learning ensemble prediction model for gas chromatographic retention index on stationary phases with different polarities

WANG Qianyi, ZHU Yongle, LI Xuehua*()   

  1. Key Laboratory of Industrial Ecology and Environmental Engineering, Ministry of Education, School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
  • Received:2024-07-21 Online:2025-04-08 Published:2025-03-26

摘要:

保留指数是在色谱分析中用于表征化合物保留性能的指标,是用于化合物结构鉴定的重要参数。化合物在不同极性固定相上的保留指数差异,使得当前基于单一极性固定相的保留指数预测模型无法有效应用于多种极性固定相的保留指数预测。因此,本研究建立了不同极性固定相上气相色谱保留指数预测模型,从文献中收集到2499种化合物在8种类型固定相上的保留指数数据共4183条,根据McReynolds常数进一步将固定相划分为强极性、极性、中等极性、弱极性与非极性五类,耦合化合物分子结构特征与固定相极性独热编码特征作为模型输入,采用9种算法构建了机器学习预测模型。基于模型性能最优的XGBoost和LightGBM算法,采用投票回归建立集成学习模型,其训练集决定系数(R2)为0.99,训练集均方根误差(RMSE)为101.85,测试集R2为0.97,测试集RMSE为107.44。采用Williams图表征模型的应用域,有94%以上的数据在应用域内。本研究综合固定相极性和化合物结构两类复合特征,成功开发了能够适应多种极性固定相的保留指数预测模型,克服了现有单一极性固定相模型的局限性,极大地拓宽了模型的应用范围。与个体机器学习模型相比,集成模型体现出了更好的稳健性和预测能力。模型的建立对于提高气相色谱靶标和非靶标分析的效率和准确性具有重要的科学意义和实际价值。

关键词: 气相色谱保留指数, 集成学习, 不同极性固定相, McReynolds常数

Abstract:

Gas chromatography is an analytical technique that is widely used to separate and identify various compounds. The retention index (RI) plays a significant role in gas chromatography because it provides a standardized measure for characterizing the retention performance of compounds under specific conditions and is a powerful compound-identification tool, particularly when dealing with complex mixtures. Consequently, the ability to predict RI values is a meaningful objective, particularly for multipolar phases, owing to significant variations in RI across various polar stationary phases. To address this issue, we developed a model for predicting gas-chromatographic RIs on stationary phases of varying polarity by collecting 4183 pieces of retention-index data for 2499 compounds on eight types of stationary phase from the literature and databases. Stationary phases were further classified into five categories based on their the McReynolds constants, namely: strongly polar, polar, medium polar, weakly polar, and non-polar. This classification ensured that the model is capable of handling a wide range of polarities, thereby enhancing its versatility and applicability to various analytical scenarios. The predictive model was constructed by integrating two types of composite feature. The 1D and 2D molecular-structural features of the compounds were first determined; these features capture the chemical and physical properties of the compounds, including their relative molecular masses, functional groups, and topological indices. These descriptors provide a comprehensive understanding of the molecular characteristics that influence retention behavior. Stationary-phase polarity was then one-hot encoded, which converted categorical stationary-phase-polarity information into a format that can be effectively used by machine-learning algorithms. This encoding technique ensures that the model can distinguish among the effects of various polarities on the retention behavior of the compounds. Nine algorithms were used to construct predictive machine-learning models, including linear regression, decision tree, random forest, support vector machine (SVM), k-nearest-neighbor (KNN), gradient-boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and light gradient boosting (LightGBM) algorithms. Voting regression was used to build an optimally performing ensemble learning model based on the XGBoost and LightGBM algorithms. This ensemble model, which combines the strengths of multiple individual models, exhibited exceptional performance, with a training set coefficient of determination (R2) of 0.99, a training set root mean square error (RMSE) of 101.85, a test set R2 of 0.97, and a test set RMSE of 107.44. Williams plots were used to characterize the application domain of the model, with over 94% of the data lying within the domain, indicative of broad applicability and high predictive confidence. The successful development of this predictive retention-index model represents a significant advancement in the gas-chromatography field. The developed model offers several key benefits by integrating advanced machine learning techniques with comprehensive chemical- and physical-property data; it highly accurately predicts RI values across a wide range of polar stationary phases. The developed ensemble model exhibits superior robustness and predictive abilities compared to individual machine-learning models. The establishment of this model is of great scientific significance and practical value for improving the efficiency and accuracy of target and non-target gas-chromatographic analyses.

Key words: gas-chromatographic retention index, ensemble learning, different polar stationary phases, McReynolds’ constant

中图分类号: