Performance Evaluation and Online Realization of Data-driven Normalization Methods Used in LC/MS based Untargeted Metabolomics Analysis

Nature

Performance Evaluation and Online Realization of Data-driven Normalization Methods Used in LC/MS based Untargeted Metabolomics Analysis"


Play all audios:

Loading...

In untargeted metabolomics analysis, several factors (e.g., unwanted experimental & biological variations and technical errors) may hamper the identification of differential metabolic


features, which requires the data-driven normalization approaches before feature selection. So far, ≥16 normalization methods have been widely applied for processing the LC/MS based


metabolomics data. However, the performance and the sample size dependence of those methods have not yet been exhaustively compared and no online tool for comparatively and comprehensively


evaluating the performance of all 16 normalization methods has been provided. In this study, a comprehensive comparison on these methods was conducted. As a result, 16 methods were


categorized into three groups based on their normalization performances across various sample sizes. The VSN, the Log Transformation and the PQN were identified as methods of the best


normalization performance, while the Contrast consistently underperformed across all sub-datasets of different benchmark data. Moreover, an interactive web tool comprehensively evaluating


the performance of 16 methods specifically for normalizing LC/MS based metabolomics data was constructed and hosted at http://server.idrb.cqu.edu.cn/MetaPre/. In summary, this study could


serve as a useful guidance to the selection of suitable normalization methods in analyzing the LC/MS based metabolomics data.


Metabolomics aims at characterizing metabolic biomarkers by analytically describing complex biological samples1. At present, the metabolomics based on liquid chromatography mass spectrometry


(LC/MS) is capable of simultaneously monitoring thousands of metabolites in bio-fluid, cell and tissue, and is widely applied to various aspects of biomedical research. In particular,


metabolomics analysis on LC/MS data can aid the choice of therapy2, provide powerful tools for drug discovery by revealing drug mechanism of actions and potential side effects3, and help to


identify biomarkers4,5,6 of various diseases such as hepatocellular carcinoma (HCC)7, colorectal cancer8, insulin resistance9, and so on.


Several factors (e.g., unwanted experimental & biological variations and technical errors) may hamper the identification of differential metabolic profiles and effectiveness of metabolomics


analysis (e.g., paired or nested studies)10,11,12,13,14. To remove specific types of unwanted variations, the signal drift correction (when quality control samples are available), the batch


effect removal (when internal standards or quality control samples are available), and the scaling (not suitable when the self-averaging property does not hold) are adopted13. These commonly


used strategies are generally grouped into two categories: (1) method-driven normalization approaches extrapolating external model that is based upon internal standards or quality control


samples and (2) data-driven normalization approaches scaling or transforming metabolomics data15,16,17,18,19,20. As reported in Ejigu’s work, the method-driven strategies may not be


practical due to several reasons, especially their unsuitability for treating untargeted metabolomics data, while data-driven ones are better choices for untargeted LC/MS based metabolomics


data15. The capacities of 11 data-driven normalization methods (“normalization method” in short for the rest of this paper) for processing nuclear magnetic resonance (NMR) based metabolomics


data were systematically compared21. Two methods (the Quantile and the Cubic Splines) were identified as the “best” performed normalization methods, while other two methods (the Contrast


and the Li-Wong) could “hardly” reduce bias at all and could not improve the comparability between samples21. For gas chromatography mass spectrometry (GC/MS) based metabolomics, a


comparative research on the performances of 8 normalization methods discovered two (the Auto Scaling and the Range Scaling) of “overall best performance”12. Similar to NMR and GC/MS, the


LC/MS is one of the most popular sources of current metabolomics data, and it is of great importance to analyze the differential influence of those methods on LC/MS based data. Ejigu et al.


measured the performance of 6 methods according to their “average metabolite specific coefficient of variation (CV)”15. The CV showed that the Cyclic Loess and the Cubic Splines performed


“slightly better” than other methods, but no statistical difference among CVs of those methods was observed15.


For the past decade, no less than 16 methods have been developed for normalizing the LC/MS based metabolomics data13,22,23, some of which (e.g., the VSN24, the Quantile25, the Cyclic


Loess26) are directly adopted from those previously used for processing transcriptomics data. Both metabolomics data and transcriptomics data are high-dimensional. However, the dimension of


transcriptomics data can reach 10 thousands, while that of metabolomics data is about a few thousands. Moreover, unlike transcriptomics, correlation among metabolites identified from


metabolomics data may not indicate a common biological function27. Apart from the above differences, there are significant similarities between two OMICs data: (1) right-skewed


distribution23, (2) great data sparsity28, (3) substantial amount of noise29,30 and (4) significantly varied sample sizes31,32. Due to these similarities, it is feasible to apply some of the


normalization methods used in transcriptomics data analysis to the metabolomics one.


Those 16 methods specifically normalizing LC/MS based metabolomics data can be classified into two groups21. Methods in group one (including the Contrast Normalization33, the Cubic


Splines34, the Cyclic Loess35, the Linear Baseline Scaling25, the MSTUS22, the Non-Linear Baseline Normalization36, the Probabilistic Quotient Normalization37 and the Quantile


Normalization25) aim at removing the unwanted sample-to-sample variations, while methods of the second group (including the Auto Scaling38, the Level Scaling12, the Log Transformation39, the


Pareto Scaling40, the Power Scaling41, the Range Scaling42, the VSN43,44 and the Vast Scaling45) adjust biases among various metabolites to reduce heteroscedasticity. However, the


performance and the sample size dependence of those methods widely adopted in current metabolomics studies (e.g., the Pareto Scaling and the VSN)28,46 have not yet been exhaustively compared


in the context of LC/MS metabolomics data analysis.


Moreover, several comprehensive metabolomics pipelines are currently available online, where various normalization algorithms are integrated in as one step in their corresponding analysis


chain. These online pipelines include the MetaboAnalyst28, the Metabolomics Workbench47, the MetaDB48, the MetDAT49, the MSPrep50, the Workflow4Metabolomics51 and the XCMS online52. Based on


a comprehensive review, the number of normalization algorithms provided by the above pipelines varies significantly from 2 (the Workflow4Metabolomics) to 13 (the MetaboAnalyst). 6 out of


those 7 pipelines only provide 100 samples selected by manual literature and dataset reviews. Based on the above criteria, 4 benchmark datasets were collected for analysis, which include the


positive (ESI+) and negative (ESI−) ionization modes of both MTBLS2854 and MTBLS1755. For MTBLS17, only the dataset of experiment 1 with >100 studied samples was included. For the remaining


text of this paper, MTBLS17 was used to stand for the dataset of experiment 1 in Ressom’s work55. Both ESI+ and ESI− of MTBLS28 provided LC/MS based metabolomics profiles of 1,005 samples


(469 lung cancer patients and 536 healthy individuals)54, and MTBLS17 ESI+ and ESI− gave profiles of 189 samples (60 HCC patients and 129 people with cirrhosis) and 185 samples (59 HCC


patients and 126 people with cirrhosis), respectively55.


To construct training and validation datasets and sub-datasets of various sample size, random sampling and k-means clustering were applied. Taking MTBLS28 ESI+ as an example, 1,005 samples


were divided into training dataset (400 lung cancer patients and 500 healthy individuals) and validation dataset (105 samples) by random sampling. Moreover, to generate the sub-datasets from


training dataset, the k-means clustering56 was used to sample 10 sub-datasets of various sample size. In particular, the number of lung cancer patients versus that of healthy individuals


were 50 vs. 40, 100 vs. 80, 150 vs. 120, 200 vs. 160, 250 vs. 200, 300 vs. 240, 350 vs. 280, 400 vs. 320, 450 vs. 360, and 500 vs. 400 for 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%


of the samples in the training group, respectively.


Biological variance and technical error are two key factors introducing biases to the metabolomics data. Biological variance arises from the spread of metabolic signals detected from various


biological samples57, while technical error results from machine drift58. In particular, biological variances (e.g., varying concentration levels of bio-fluid, different cell sizes, varying


sample measurements) are commonly encountered in metabolomics data13, while technical errors (e.g., a sudden drop in peak intensities or measurements on different instruments) are the major


issues in large-scale metabolomics studies58. Apart from those above methods widely adopted to remove biological variances22, quality-control (QC) samples were used to significantly reduce


technical errors58.


Moreover, sparsity is the nature of metabolomics data, which can be represented by a substantial amount of missing values (10~40%), which can affect up to 80% of all metabolic features59.


The direct assignment of zero to the missing values could be useful for cluster analysis, but it may lead to poor performance or even malfunction if normalization method is applied50,


especially for those methods based on the logarithm (e.g., the Log Transformation)50,53. Several missing value imputation methods are currently available, among which the KNN algorithm60 was


reported as the most robust one for analyzing mass spectrometry based metabolomics data60. Therefore, the KNN algorithm was adopted in this work to impute the missing signals of the


metabolic features.


In this study, a widely adopted data pre-processing procedure54,60,61 was applied, which included sample filtering, data matrix construction and signal filtering & imputing (Fig. 1). In


particular, (1) samples with signal interruption or not detectable internal standard were removed based on Mathé’s work54; (2) peak detection, retention time correction and peak alignment54


were applied to the UHPLC/Q-TOF-MS raw data (in CDF format) using the xcmsSet, the group and the rector functions in the XCMS package62 with both the full width at half-maximum (fwhm) and


the retention time window (bw) set as 10; (3) metabolic features detected in 1) of the partial least squares discriminant analysis (PLS-DA)84 in R package ropls85 together with p-value (1)


of PLS-DA model. Then, SVM models were constructed based on these identified differential features. After k-folds cross validation, ROC curve together with its AUC value were calculated and


displayed on the web page.


MetaPre is valuable online tool to select suitable methods for normalizing LC/MS based metabolomics data, and is a useful complement to the currently available tools in modern metabolomics


analysis.


Based on the 4 datasets tested in this work, 16 methods for normalizing LC/MS based metabolomics data were categorized into three groups based on their normalization performances across


various sample sizes, which included the superior (3 methods), good (12 methods) and poor (1 method) performance groups. The VSN, the Log Transformation and the PQN were identified as


methods of the best normalization performance, while the Contrast consistently underperformed across all sub-datasets of different benchmark data among those 16 methods. Moreover, an


interactive web tool comprehensively evaluating the performance of all 16 methods for normalizing LC/MS based metabolomics data was constructed and hosted at


http://server.idrb.cqu.edu.cn/MetaPre/. In sum, this study could serve as guidance to the selection of suitable normalization methods in analyzing the LC/MS based metabolomics data.


How to cite this article: Li, B. et al. Performance Evaluation and Online Realization of Data-driven Normalization Methods Used in LC/MS based Untargeted Metabolomics Analysis. Sci. Rep. 6,


38881; doi: 10.1038/srep38881 (2016).


Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


This work was funded by the research support of National Natural Science Foundation of China (81202459, 21505009 and 21302102); by Innovation Project on Industrial Generic Key Technologies


of Chongqing (cstc2015zdcy-ztzx120003); by the Chongqing Graduate Student Research Innovation Project (CYB14027); by the Fundamental Research Funds for the Central Universities


(CDJZR14468801, CDJKXB14011, 2015CDJXY).


Li Bo, Tang Jing and Yang Qingxia contributed equally to this work.


Innovative Drug Research and Bioinformatics Group, Innovative Drug Research Centre and School of Pharmaceutical Sciences, Chongqing University, Chongqing, 401331, China


Bo Li, Jing Tang, Qingxia Yang, Xuejiao Cui, Shuang Li, Quanxing Cao, Weiwei Xue, Na Chen & Feng Zhu


College of Mathematics and Statistics, Chongqing University, Chongqing, 401331, China


F.Z. designed research. B.L., J.T., Q.Y., X.C. and S.C. performed research and developed the web tool. B.L., W.X., N.C., S.L. and Q.C wrote the scripts and prepared the example data. B.L.


and F.Z. wrote the manuscript. All authors reviewed the manuscript.


This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons


license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to


reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/


Anyone you share the following link with will be able to read this content:


Trending News

Evidence for simple volcanic rifting not complex subduction initiation in the Laxmi Basin

Download PDF Matters Arising Open access Published: 01 June 2020 Evidence for simple volcanic rifting not complex subduc...

Latests News

Evidence for simple volcanic rifting not complex subduction initiation in the Laxmi Basin

Download PDF Matters Arising Open access Published: 01 June 2020 Evidence for simple volcanic rifting not complex subduc...

Top