不同极性固定相上气相色谱保留指数机器学习集成预测模型的构建

doi:10.3724/SP.J.1123.2024.07014

摘要/Abstract

摘要：

保留指数是在色谱分析中用于表征化合物保留性能的指标,是用于化合物结构鉴定的重要参数。化合物在不同极性固定相上的保留指数差异,使得当前基于单一极性固定相的保留指数预测模型无法有效应用于多种极性固定相的保留指数预测。因此,本研究建立了不同极性固定相上气相色谱保留指数预测模型,从文献中收集到2499种化合物在8种类型固定相上的保留指数数据共4183条,根据McReynolds常数进一步将固定相划分为强极性、极性、中等极性、弱极性与非极性五类,耦合化合物分子结构特征与固定相极性独热编码特征作为模型输入,采用9种算法构建了机器学习预测模型。基于模型性能最优的XGBoost和LightGBM算法,采用投票回归建立集成学习模型,其训练集决定系数(R²)为0.99,训练集均方根误差(RMSE)为101.85,测试集R²为0.97,测试集RMSE为107.44。采用Williams图表征模型的应用域,有94%以上的数据在应用域内。本研究综合固定相极性和化合物结构两类复合特征,成功开发了能够适应多种极性固定相的保留指数预测模型,克服了现有单一极性固定相模型的局限性,极大地拓宽了模型的应用范围。与个体机器学习模型相比,集成模型体现出了更好的稳健性和预测能力。模型的建立对于提高气相色谱靶标和非靶标分析的效率和准确性具有重要的科学意义和实际价值。

关键词: 气相色谱保留指数, 集成学习, 不同极性固定相, McReynolds常数

Abstract:

Gas chromatography is an analytical technique that is widely used to separate and identify various compounds. The retention index (RI) plays a significant role in gas chromatography because it provides a standardized measure for characterizing the retention performance of compounds under specific conditions and is a powerful compound-identification tool, particularly when dealing with complex mixtures. Consequently, the ability to predict RI values is a meaningful objective, particularly for multipolar phases, owing to significant variations in RI across various polar stationary phases. To address this issue, we developed a model for predicting gas-chromatographic RIs on stationary phases of varying polarity by collecting 4183 pieces of retention-index data for 2499 compounds on eight types of stationary phase from the literature and databases. Stationary phases were further classified into five categories based on their the McReynolds constants, namely: strongly polar, polar, medium polar, weakly polar, and non-polar. This classification ensured that the model is capable of handling a wide range of polarities, thereby enhancing its versatility and applicability to various analytical scenarios. The predictive model was constructed by integrating two types of composite feature. The 1D and 2D molecular-structural features of the compounds were first determined; these features capture the chemical and physical properties of the compounds, including their relative molecular masses, functional groups, and topological indices. These descriptors provide a comprehensive understanding of the molecular characteristics that influence retention behavior. Stationary-phase polarity was then one-hot encoded, which converted categorical stationary-phase-polarity information into a format that can be effectively used by machine-learning algorithms. This encoding technique ensures that the model can distinguish among the effects of various polarities on the retention behavior of the compounds. Nine algorithms were used to construct predictive machine-learning models, including linear regression, decision tree, random forest, support vector machine (SVM), k-nearest-neighbor (KNN), gradient-boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and light gradient boosting (LightGBM) algorithms. Voting regression was used to build an optimally performing ensemble learning model based on the XGBoost and LightGBM algorithms. This ensemble model, which combines the strengths of multiple individual models, exhibited exceptional performance, with a training set coefficient of determination (R²) of 0.99, a training set root mean square error (RMSE) of 101.85, a test set R² of 0.97, and a test set RMSE of 107.44. Williams plots were used to characterize the application domain of the model, with over 94% of the data lying within the domain, indicative of broad applicability and high predictive confidence. The successful development of this predictive retention-index model represents a significant advancement in the gas-chromatography field. The developed model offers several key benefits by integrating advanced machine learning techniques with comprehensive chemical- and physical-property data; it highly accurately predicts RI values across a wide range of polar stationary phases. The developed ensemble model exhibits superior robustness and predictive abilities compared to individual machine-learning models. The establishment of this model is of great scientific significance and practical value for improving the efficiency and accuracy of target and non-target gas-chromatographic analyses.

Key words: gas-chromatographic retention index, ensemble learning, different polar stationary phases, McReynolds’ constant

中图分类号:

O658

王芊懿, 朱永乐, 李雪花. 不同极性固定相上气相色谱保留指数机器学习集成预测模型的构建[J]. 色谱, 2025, 43(4): 355-362.

WANG Qianyi, ZHU Yongle, LI Xuehua. Construction of a machine learning ensemble prediction model for gas chromatographic retention index on stationary phases with different polarities[J]. Chinese Journal of Chromatography, 2025, 43(4): 355-362.

图/表 11

表1 4237条建模数据来源及分布情况

Table 1 Source and distribution of 4237 modeling data

Category	Number	Chromatographic column	Stationary phase	McReynolds’ constant	Refs.
Strong polarity	1372	carbowax-20M	polyethylene glycol	462	[14-18]
Polarity	198	DB-225MS	50% cyanopropylphenyl-50% dimethylpolysiloxane	363^*	[19]
Medium polarity	484	DB-624	6% cyanopropylphenyl-94% dimethylpolysiloxane	158^*	[20]
		OV17	50% diphenyl-50% dimethylpolysiloxane	177	[14]
Weak polarity	1316	DB-5	5% diphenyl-95% dimethylpolysiloxane	67	[14,21]
		HP5-MS		67	[19]
		-		67	[15,16]
Non-polar	817	HP-1	100% dimethylpolysiloxane	44	[19]
		OV101		44	[14]
		-		44	[15,16,22]

图1 不同随机种子下(a)RMSE和(b)R2的值

Fig. 1 Values of (a) RMSE and (b) R2 for different random seeds RMSE: root mean square error; R2: coefficient of determination.

图2 10折交叉验证进行超参数选择的流程图

Fig. 2 Flow chart of 10 fold cross validation for hyperparameter selection

图3 不同极性色谱柱下化合物RI单因素方差分析(n=45)

Fig. 3 One-way analysis of variance for retention index (RI) of compounds with different polarity columns (n=45) a and b meant there was a significant difference among groups (p<0.05).

图4 随机森林特征重要性排序

Fig. 4 Feature importance ranking based on random forest

表2 10种机器学习模型的超参数

Table 2 Hyperparameters for 10 machine learning prediction models

Regression model	Hyperparameterization
LR	-
DT	max_depth=17.00, random_state=85.00, min_
	samples_leaf=1.00, min_samples_split=4.00
RF	n_estimators=251.00, random_state=0.00,
	max_depth=17.00
SVR	kernel=radial basis function, C=49.00
KNN	n_neighbors=7.00, weights=‘distance’
GBDT	n_estimators=291.00, random_state=0.00
XGBoost	Booster=‘gbtree’, n_estimators=166.00, earning_
	rate=0.12, max_depth=5.00, colsample_bytree=0.56,
	gamma=0.99, reg_alpha=0.57, reg_lambda=0.91,
	subsample=0.90
AdaBoost	base_estimator=DecisionTreeRegressor, n_estimators=
	21.00, random_state=80.00, learning_rate=0.30
LightGBM	n_estimators=299.00, learning_rate=0.10,
	max_depth=5.00, random_state=80.00
VR	-

表3 10种机器学习模型的预测性能

Table 3 Predictive performance of 10 machine learning models

Regression model	Training (n=2928)			Testing (n=1255)
Regression model	R²	$Q C V 2$	RMSE	R²	$Q e x t 2$	RMSE
LR	0.93	0.93	163.81±14.55	0.93	0.94	153.55±12.13
DT	0.99	0.96	166.19±22.53	0.956	0.95	186.87±19.72
RF	1.00	0.93	114.31±15.32	0.92	0.91	134.78±13.52
SVR	0.88	0.86	228.68±32.04	0.87	0.77	288.76±36.38
KNN	0.99	0.92	165.20±16.84	0.91	0.90	180.24±20.13
GBDT	0.99	0.96	113.12±17.95	0.96	0.97	108.04±8.87
XGBoost	0.99	0.97	106.03±14.25	0.97	0.97	107.82±14.60
AdaBoost	0.99	0.96	116.03±18.33	0.96	0.94	143.13±18.73
LightGBM	0.99	0.97	104.67±17.19	0.97	0.96	116.94±15.77
VR	0.99	0.97	101.85±17.73	0.97	0.97	107.44±15.63

表3 10种机器学习模型的预测性能

Table 3 Predictive performance of 10 machine learning models

Regression model	Training (n=2928)			Testing (n=1255)
Regression model	R²	$Q C V 2$	RMSE	R²	$Q e x t 2$	RMSE
LR	0.93	0.93	163.81±14.55	0.93	0.94	153.55±12.13
DT	0.99	0.96	166.19±22.53	0.956	0.95	186.87±19.72
RF	1.00	0.93	114.31±15.32	0.92	0.91	134.78±13.52
SVR	0.88	0.86	228.68±32.04	0.87	0.77	288.76±36.38
KNN	0.99	0.92	165.20±16.84	0.91	0.90	180.24±20.13
GBDT	0.99	0.96	113.12±17.95	0.96	0.97	108.04±8.87
XGBoost	0.99	0.97	106.03±14.25	0.97	0.97	107.82±14.60
AdaBoost	0.99	0.96	116.03±18.33	0.96	0.94	143.13±18.73
LightGBM	0.99	0.97	104.67±17.19	0.97	0.96	116.94±15.77
VR	0.99	0.97	101.85±17.73	0.97	0.97	107.44±15.63

图5 投票回归模型的预测保留指数与实验保留指数关系图

Fig. 5 Relationship between predictive and experimental retention index of voting regression model

图6 24个特征的单个SHAP值

Fig. 6 Individual SHAP values for 24 features

图7 投票回归模型的Williams图

Fig. 7 Williams plot for voting regression model

表4 集成学习模型和以前模型的比较

Table 4 Comparison of the ensemble learning prediction model with previous models

Order	Method	Stationary phases	Number	Training			Testing			Ref.
Order	Method	Stationary phases	Number	R²	RMSE	MAE	R²	RMSE	MAE	Ref.
1	VR	five polar stationary phases	4183	0.99	101.85	23.60	0.97	107.44	75.28	this study
2	GNN	SSNP	29518	-	-	11.80	0.99	-	30.92	[11]
		SNP	14033	-	-	23.33	0.99	-	42.41
		SP	7052	-	-	45.46	0.95	-	84.34
3	GNN	non polarity	94183	-	20.69	-	-	57.90	-	[12]
4	PLR	non polarity	90	-	-	-	0.99	17.40	-	[10]
5	-	strong polarity	1179	0.83	170.90	124.20	0.90	132.90	102.70	[8]

参考文献

[1]	Stoffel R, Quilliam M, Hardt N, et al. Anal Bioanal Chem, 2021, 414(25): 7387
[2]	Bizzo H, Brilhante N, Nolvachai Y, et al. J Chromatogr A, 2023, 1708: 464376
[3]	Anjum A, Liigand J, Milford R, et al. J Chromatogr A, 2023, 1705: 464176
[4]	Qu C, Schneider B, Kearsley A, et al. J Chromatogr A, 2021, 1646: 462100
[5]	Randazzo G M, Bileck A, Danani A, et al. J Chromatogr A, 2020, 1612: 460661
[6]	Matyushin D D, Buryak A. IEEE Access, 2020, 8: 223140
[7]	Acimovic M, Pezo L, Tesevic V, et al. Ind Crops Prod, 2020, 154: 112752
[8]	Ahmadi S, Lotfi S, Hamzehali H, et al. RSC Adv, 2024, 14(5): 3186
[9]	Sun L K, Zhang M, Xie L, et al. Chem Biol Drug Des, 2023, 101(2): 380
[10]	Kumar A, Kumar P, Singh D. Chemom Intell Lab Syst, 2022, 224: 104552
[11]	Matyushin D, Sholokhova A, Buryak A. J Chromatogr A, 2019, 1607: 460395
[12]	Szucs R, Brown R, Brunelli C, et al. J Chromatogr A, 2023, 1707: 464317
[13]	McReynolds W. J Chromatogr Sci, 1970, 8: 685
[14]	Yan J, Cao D, Guo F, et al. J Chromatogr A, 2012, 1233: 118
[15]	Babushok V I, Linstrom P, Zenkevich I. J Phys Chem Ref Data, 2011, 40(4): 043101
[16]	Babushok V, Zenkevich I. Chromatographia, 2008, 69(3): 257
[17]	Yan A, Jiao G, Hu Z, et al. Comput Chem, 2000, 24(2): 171
[18]	Rojas C, Duchowicz P, Tripaldi P, et al. J Chromatogr A, 2015, 1422: 277
[19]	Yan J, Liu X, Zhu W, et al. Chromatographia, 2015, 78(1): 89
[20]	Dossin E, Martin E, Diana P, et al. Anal Chem, 2016, 88(15): 7539
[21]	Pyright^© Shimadzu (China)Co: GC-MS/MS Database. [2024-01-13]. https://support.shimadzu.com.cn/an/library/index.html
[22]	Farkas O, Héberger K, Zenkevich I. Chemom Intell Lab Syst, 2004, 72(2): 173
[23]	Gurvich V, Naumova M. Symmetry-basel, 2021, 13(8): 1387
[24]	Todeschini R, Consonni V. Molecular Descriptors for Chemoinformatics. 2nd ed. Weinheim: WILEY-VCH Verlag GmbH & Co. KGaA, 2009
[25]	Platts J, Butina D, Abraham M, et al. J Chem Inf Model, 1999, 39(5): 835