Chinese Journal of Chromatography ›› 2025, Vol. 43 ›› Issue (4): 355-362.DOI: 10.3724/SP.J.1123.2024.07014

• Articles • Previous Articles     Next Articles

Construction of a machine learning ensemble prediction model for gas chromatographic retention index on stationary phases with different polarities

WANG Qianyi, ZHU Yongle, LI Xuehua*()   

  1. Key Laboratory of Industrial Ecology and Environmental Engineering, Ministry of Education, School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
  • Received:2024-07-21 Online:2025-04-08 Published:2025-03-26

Abstract:

Gas chromatography is an analytical technique that is widely used to separate and identify various compounds. The retention index (RI) plays a significant role in gas chromatography because it provides a standardized measure for characterizing the retention performance of compounds under specific conditions and is a powerful compound-identification tool, particularly when dealing with complex mixtures. Consequently, the ability to predict RI values is a meaningful objective, particularly for multipolar phases, owing to significant variations in RI across various polar stationary phases. To address this issue, we developed a model for predicting gas-chromatographic RIs on stationary phases of varying polarity by collecting 4183 pieces of retention-index data for 2499 compounds on eight types of stationary phase from the literature and databases. Stationary phases were further classified into five categories based on their the McReynolds constants, namely: strongly polar, polar, medium polar, weakly polar, and non-polar. This classification ensured that the model is capable of handling a wide range of polarities, thereby enhancing its versatility and applicability to various analytical scenarios. The predictive model was constructed by integrating two types of composite feature. The 1D and 2D molecular-structural features of the compounds were first determined; these features capture the chemical and physical properties of the compounds, including their relative molecular masses, functional groups, and topological indices. These descriptors provide a comprehensive understanding of the molecular characteristics that influence retention behavior. Stationary-phase polarity was then one-hot encoded, which converted categorical stationary-phase-polarity information into a format that can be effectively used by machine-learning algorithms. This encoding technique ensures that the model can distinguish among the effects of various polarities on the retention behavior of the compounds. Nine algorithms were used to construct predictive machine-learning models, including linear regression, decision tree, random forest, support vector machine (SVM), k-nearest-neighbor (KNN), gradient-boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and light gradient boosting (LightGBM) algorithms. Voting regression was used to build an optimally performing ensemble learning model based on the XGBoost and LightGBM algorithms. This ensemble model, which combines the strengths of multiple individual models, exhibited exceptional performance, with a training set coefficient of determination (R2) of 0.99, a training set root mean square error (RMSE) of 101.85, a test set R2 of 0.97, and a test set RMSE of 107.44. Williams plots were used to characterize the application domain of the model, with over 94% of the data lying within the domain, indicative of broad applicability and high predictive confidence. The successful development of this predictive retention-index model represents a significant advancement in the gas-chromatography field. The developed model offers several key benefits by integrating advanced machine learning techniques with comprehensive chemical- and physical-property data; it highly accurately predicts RI values across a wide range of polar stationary phases. The developed ensemble model exhibits superior robustness and predictive abilities compared to individual machine-learning models. The establishment of this model is of great scientific significance and practical value for improving the efficiency and accuracy of target and non-target gas-chromatographic analyses.

Key words: gas-chromatographic retention index, ensemble learning, different polar stationary phases, McReynolds’ constant

CLC Number: