色谱 ›› 2021, Vol. 39 ›› Issue (3): 211-218.DOI: 10.3724/SP.J.1123.2020.08015

• 专论与综述 • 上一篇    下一篇

基于深度学习的保留时间预测方法的研究进展及应用

杜卓锟1,2, 邵伟1, 秦伟捷1,2,*()   

  1. 1.安徽医科大学基础医学院, 安徽 合肥 230032
    2.军事科学院军事医学研究院生命组学研究所, 北京蛋白质组研究中心, 蛋白质组学国家重点实验室, 北京 102206
  • 收稿日期:2020-08-20 出版日期:2021-03-08 发布日期:2021-02-03
  • 通讯作者: 秦伟捷
  • 作者简介:*Tel:(010)61777111,E-mail: aunp_dna@126.com.
  • 基金资助:
    国家重点研发计划项目(2017YFA0505002);国家重点研发计划项目(2018YFC0910302);国家重点研发计划项目(2016YFA0501403)

Research progress and application of retention time prediction method based on deep learning

DU Zhuokun1,2, SHAO Wei1, QIN Weijie1,2,*()   

  1. 1. School of Basic Medicine, Anhui Medical University, Hefei 230032, China
    2. State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Lifeomics, Beijing 102206, China
  • Received:2020-08-20 Online:2021-03-08 Published:2021-02-03
  • Contact: QIN Weijie
  • Supported by:
    National Key Research and Development Program of China(2017YFA0505002);National Key Research and Development Program of China(2018YFC0910302);National Key Research and Development Program of China(2016YFA0501403)

摘要:

在基于液相色谱-质谱联用的蛋白质组学研究中,肽段的保留时间作为有效区分不同肽段的特征参数,可以根据肽段自身的序列等信息对其进行预测。使用预测得到的保留时间辅助质谱数据鉴定肽段序列可以提高鉴定的准确性,因此对保留时间预测的工作一直受到领域内的广泛关注。传统的保留时间预测方法通常是根据氨基酸序列计算肽段的理化性质,进而计算肽段在特定色谱条件下的保留时间。近年来,深度学习方法取得了极大的进展,在蛋白质组学研究中发挥着越来越重要的作用。目前已发展出了多种基于深度学习的保留时间预测方法,与传统的保留时间预测方法相比有着更高的准确度,易于跨平台使用,并且能对修饰肽段的保留时间进行预测。但对某些复杂的修饰,如糖基化修饰等的预测结果还不够准确。如何进一步提高对修饰肽段预测的准确性是基于深度学习的保留时间预测方法的重要研究方向。这些预测的保留时间被应用于肽段鉴定的质量控制和方法评估,以及与预测的二级质谱谱图结合,建立模拟谱图库等方面。该文综述了深度学习方法在保留时间预测领域的最新研究进展以及应用成果,同时对其发展趋势和未来的应用方向进行了展望,以期为保留时间预测研究以及蛋白质组鉴定工作提供参考。

关键词: 液相色谱-串联质谱, 保留时间, 深度学习, 蛋白质组

Abstract:

In “shotgun” proteomics strategy, the proteome is explained by analyzing tryptic digested peptides using liquid chromatography-mass spectrometry. In this strategy, the retention time of peptides in liquid chromatography separation can be predicted based on the peptide sequence. This is a useful feature for peptide identification. Therefore, the prediction of the retention time has attracted much research attention. Traditional methods calculate the physical and chemical properties of the peptides based on their amino acid sequence to obtain the retention time under certain chromatography conditions; however, these methods cannot be directly adopted for other chromatography conditions, nor can they be used across laboratories or instrument platforms. To solve this problem, in recent years, deep learning was introduced to proteomics research for retention time prediction. Deep learning is an advanced machine-learning method that has extraordinary capability to learn complex relationships from large-scale data. By stacking multiple hidden neural networks, deep learning can ingest raw data without manually designed features. Transfer learning is an important method in deep learning. It improves the learning process a new task through the transfer of knowledge from an already-learned related task. Transfer learning allows models trained using large datasets to be utilized across conditions by fine-tuning on smaller datasets, instead of retraining the whole model. Many retention time prediction methods have been developed. In the process of training the model, the sequences of peptides are encoded to represent peptide information. Deep learning considers the relationship between the characteristics of the peptides and their corresponding retention times without the need for manual input of the physical and chemical properties of the peptides. Compared with traditional methods, deep learning methods have higher accuracy and can be easily used under different chromatography conditions by transfer learning. If there are not enough datasets to train a new model, a trained model from other datasets can be used as a replacement after calibration with small datasets obtained from these chromatography conditions. While the retention times of modified peptides can also be predicted, the predictions are inadequate for complex modifications such as glycosylation, and this is one of the main problems to be solved. The predicted retention times were used to control the quality of peptide identification. With high accuracy, the predicted retention times can be considered as actual retention times. Therefore, the difference between predicted and observed retention times can serve as an effective and unbiased quantitative metric for evaluating the quality of peptide-spectrum matches (PSMs) reported using different peptide identification methods. Combined with fragment ion intensity prediction, retention time prediction is used to generate spectral libraries for data-independent acquisition (DIA)-based mass spectrometry analysis. Generally, DIA methods identify peptides using specific spectrum libraries obtained from data-dependent acquisition (DDA) experiments. As a result, only peptides detected in the DDA experiments can be present in the libraries and detected in DIA. Furthermore, it takes a lot of time and effort to build libraries from DDA experiments, and typically, they cannot be adopted across different laboratories or instrument platforms. In contrast, the pseudo spectral libraries generated by retention times and fragment ion intensity prediction can overcome these shortcomings. The pseudo spectral libraries generate theoretical spectra of all possible peptides without the need for DDA experiments. This paper reviews the research progress of deep learning methods in the prediction of retention time and in related applications in order to provide references for retention time prediction and protein identification. At the same time, the development direction and application trend of retention time prediction methods based on deep learning are discussed.

Key words: liquid chromatography-tandem mass spectrometry(LC-MS/MS), retention time, deep learning, proteomics

中图分类号: