Metabolomic Analysis of Gastric Cancer Progression within the

Jul 7, 2016 - Gastric cancer (GC) is among the most common cancers worldwide. Gastric carcinogenesis is a multistep and multifactorial process beginni...
0 downloads 4 Views 972KB Size
Subscriber access provided by University of Sussex Library

Article

Metabolomic analysis of gastric cancer progression within the Correa’s cascade using ultraperformance liquid chromatography - mass spectrometry Julia Kuligowski, Daniel Sanjuan-Herráez, Maria A Vázquez-Sánchez, Anna Brunet-Vega, Carles Pericay, Maria Jose Ramírez-Lázaro, Sergio Lario, Lourdes Gombau, Félix Junquera, Xavier Calvet, and Guillermo Quintas J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.6b00281 • Publication Date (Web): 07 Jul 2016 Downloaded from http://pubs.acs.org on July 7, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Metabolomic analysis of gastric cancer progression within the Correa’s cascade using ultraperformance liquid chromatography - mass spectrometry Julia Kuligowskia**, Daniel Sanjuan-Herráezb**, María A. Vázquez-Sánchezb, Anna Brunet-Vegac,d, Carles Pericayc, Maria José Ramírez-Lázaroe,f, Sergio Lariod,e,f, Lourdes Gombaub, Félix Junquerae,f, Xavier Calvete,f, Guillermo Quintásb,g aNeonatal bSafety cOncology

and Sustainability Division, Leitat Technological Center, Barcelona, Spain

Service, Hospital de Sabadell, Institut Universitari Parc Taulí-UAB, Sabadell, Spain

dFundació eDigestive fCentro

Research Unit, Health Research Institute La Fe, Valencia, Spain

Parc Tauli, Institut Universitari Parc Taulí-UAB, Sabadell, Spain

Diseases Service, Hospital de Sabadell, Institut Universitari Parc Taulí-UAB, Sabadell, Spain

de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Madrid, Spain gAnalytical

Unit, Health Research Institute La Fe, Valencia, Spain

**Both authors contributed equally to this work Author information *Corresponding author: [email protected], Phone: (+34) 93 788 23 00 (ext. 3502)

ABSTRACT Gastric cancer (GC) is among the most common cancers worldwide. Gastric carcinogenesis is a multistep and multifactorial process beginning with chronic gastritis induced by Helicobacter pylori (H. pylori) infection. This process is often described via a sequence of events known as Correas’s cascade, a stepwise progression from non-active gastritis, chronic active gastritis, precursor lesions of gastric cancer (atrophy, intestinal metaplasia, dysplasia) and finally adenocarcinoma. Our aim was to identify a plasma metabolic pattern characteristic of GC through disease progression within the Correa’s cascade. This study involved the analysis of plasma samples collected from 143 patients classified in four groups: patients with non-active gastritis and no H. pylori infection, H. pylori infected patients with chronic active gastritis, infected or non-infected patients with precursor lesions of gastric cancer and GC. Independent partial least squares – discriminant binary models of UPLCESI(+)-TOFMS metabolic profiles, implemented in a decision directed acyclic graph, allowed the identification of tryptophan and kynurenine as discriminant metabolites that could be attributed to indoleamine-2,3-dioxygenase up-regulation in cancer patients leading to tryptophan

depletion

and

kynurenine

metabolites

generation.

Furthermore,

phenylacetylglutamine was also classified as a discriminant metabolite. Our data suggest the use of tryptophan, kynurenine and phenylacetylglutamine as potential GC biomarkers. Keywords Gastric cancer, metabolomics, Helicobacter pylori, tryptophan, phenylacetylglutamine, indoleamine-2,3-dioxygenase, UPLC-MS

1

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 23

1. INTRODUCTION Gastric cancer (GC) is the fifth most common cancer worldwide1 and its prognosis is related to the stage of disease at diagnosis2. Gastric carcinogenesis is a multistep and multifactorial process beginning with active chronic gastritis induced by Helicobacter pylori (H. pylori) infection3,4. The progression is often described via a sequence of events known as Correa’s cascade5,6, a stepwise progression from non-active gastritis, chronic active gastritis, precursor lesions of gastric cancer (atrophy, intestinal metaplasia, dysplasia) and gastric adenocarcinoma. Early stages of GC are often asymptomatic, when diagnosed and resected, the 5-year survival rate reaches over 90%. For patients diagnosed at advanced stages, however, the 5-year survival rate drops to 20%. GC diagnosis is based on endoscopy, biopsy and pathological examinations. Because of that, the development of non-invasive biomarkers of GC has become an active field of research that would improve the detection rate of early GC and therefore the prognosis of this prevalent neoplasia7. Metabolomics aims at the comprehensive quantitative analysis of all metabolites in a cellular system in a given state at a given time8. Metabolites reflect the interaction across the genome, transcriptome and proteome with the environment. Therefore, the identification of perturbed metabolic pathways may lead to the recognition of phenotypes associated to specific physiological conditions, enabling the development of diagnostic classifiers and providing insight into biological processes. In this context, metabolomics has emerged as a promising technology that is increasingly being used in biomedical research7,9, including oncology and gastric

cancer7,10,11.

Using

gas

chromatography

time-of-flight

mass

spectrometry,

metabolomic analysis of plasma showed a significant difference between chronic active gastritis and GC metabolic profiles, while metabolic phenotypes of intestinal metaplasia were similar to GC12. More recently, a multicenter study13 analyzed the plasma free amino acids (PFAAs) profiles of patients diagnosed with either lung, gastric, colorectal, breast or prostate cancer. Results showed significant alterations in PFAA profiles of cancer patients involving a limited set of amino acids reflecting metabolic changes common to different types of cancer and a larger group of amino acids specific to each cancer. An additional study analyzed the differences between 17 PFAAs from cancer patients and age-matched healthy controls. Besides, levels of 4 PFAAs showed dynamic alterations during the perioperative period (5-15 days after tumor resection) in GC patients14. Metabolomic analysis of GC has also been carried out in urine samples. Samples from GC patients and healthy individuals and 30 pairs of matched tumor and normal stomach tissues were analyzed using 1H nuclear magnetic resonance and 1H high-resolution magic angle spinning spectroscopy, respectively. Amino acid and lipid alterations in urine from GC patients were consistent with changes observed in GC tissue15.

2

ACS Paragon Plus Environment

Page 3 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

The objective of the current study was to identify a plasma metabolic pattern characteristic of GC through disease progression within the Correa’s cascade. Accordingly, the study involved the analysis by ultra-performance liquid chromatography – time of flight mass spectrometry (UPLC-TOFMS) of plasma samples collected from 143 patients with non-active gastritis and no H. pylori infection (NAG-), patients with chronic active gastritis and H. pylori infection (CAG+), precursor lesions of gastric cancer with and without H. pylori infection (PLGC) or gastric cancer. This multi-class problem was analyzed through six independent binary problems implemented in a decision directed acyclic graph. Results obtained allowed the identification of tryptophan and kynurenine as discriminant metabolites that could be attributed to indoleamine-2,3-dioxygenase (IDO) up-regulation in cancer patients leading to tryptophan

depletion

and

kynurenine

metabolites

generation.

Furthermore,

phenylacetylglutamine was also classified as a discriminant metabolite. Our data suggest the use of tryptophan, kynurenine and phenylacetylglutamine as potential GC biomarkers.

2. EXPERIMENTAL SECTION Patients The study was approved by the Ethics Committee of the Corporació Sanitària Parc Taulí (Institut Universitari Parc Taulí, Sabadell, Spain). Outpatients referred to the Endoscopy Unit for evaluation of dyspepsia and patients with GC undergoing preoperative endoscopic ultrasound (EUS) were asked to participate. The inclusion period was from February 2009 to February 2015. Collection of plasma samples from NAG, CAG, PLGC and GC patients started simultaneously and no difference was found in the distributions. A single sample collection protocol was used for all patients included in the study. Dyspeptic patients16 were contacted three weeks prior to the endoscopy. Those who agreed to participate were instructed to avoid antisecretory drugs within two weeks before the test. Exclusion criteria were: patients unable to stop antisecretory drugs, those who had received antibiotics within four weeks before the endoscopy and those with previous H. pylori treatment. Before the endoscopy, a 13[C]-urea

breath test (UBT) (UBiTest 100 mg, Otsuka Pharmaceutical Europe Ltd, UK) was

administered. During endoscopy, antral biopsies for histology and for rapid urease test (RUT, JATROX HP test CHR Heim Arzneimittel GmbH, Germany) were obtained. Histological examination was evaluated by a pathologist specialized in digestive diseases. Each specimen was studied for the presence of H. pylori, chronic active gastritis, atrophy, intestinal metaplasia and presence of lymphoid follicles. Patients with atrophy and/or intestinal metaplasia were classified as PLGC. Patients with concordance of RUT, UBT and histopathology (Giemsa staining) were considered H. pylori positive. Patients with all tests negative were considered uninfected, in accordance with the recommendations of the European Hp Study Group17. Plasma samples from GC patients were collected before the EUS

3

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

procedure for preoperative staging. After surgical resection, GC was staged according to the TNM staging system. The study enrolled 145 patients. Two GC samples were excluded from the data analysis because the tumor stage classification was not available, leaving 143 plasma samples (65 males and 78 females, age range 19–74 years). Forty patients were H. pylori–negative with non-active gastritis (NAG-), 38 were H. pylori–positive with chronic active gastritis (CAG+) and 32 had precursor lesions of gastric cancer (atrophy and/or intestinal metaplasia, PLGC). Five patients had peptic ulcer (Table 1). The gastric cancer group was composed of 33 patients (19 males and 14 females, age range 39-74 years). Fifteen tumors were located in the antrum, 9 in the corpus and 7 in the cardiac. Eleven adenocarcinomas were intestinal type, 13 were diffuse and 2 mixed. Nineteen patients were diagnosed at early stages (TNM I+II) (Table 1). Sample collection and preparation Five ml whole blood samples were collected into Vacutainer EDTA-K3 tubes (BD Biosciences, Spain). Plasma was prepared by centrifugation of the samples at 2400 x g for 10 min at 4°C and stored at -80°C until analysis. Prior to analysis, EDTA-plasma samples were thawed on ice. Then, 60 µl of plasma were withdrawn and 180 µl of cold methanol were added for protein precipitation. Samples were homogenized (vortex, 10 s) and centrifuged at 13000 x g (4°C, 15 min). Two-hundred µl of the supernatant were collected, transferred to a 96 wellplate and evaporated to dryness under vacuum at 25°C (SpeedVac Concentrator SPD121P, Thermo, Massachusetts, USA). The residue was redissolved in 50 µl of a solution containing phenylalanine-D5 (Cambridge Isotopes Laboratory Inc., Andover, MA, USA), caffeine-D9 (Toronto Research Chemicals, Toronto, Ontario, Canada), leukine enkephalin and reserpine (Sigma-Aldrich Química SA, Madrid, Spain) in H2O:CH3OH (1:1, 0.1% v/v HCOOH) each at a concentration of 4 µM. Blanks were prepared by replacing plasma samples by H2O:CH3OH (1:1, 0.1% v/v HCOOH). A quality control (QC) sample was prepared by mixing 5 µl of each sample. UPLC-TOFMS sample analysis and peak table generation The sample set included 35 QC replicates dispersed evenly throughout the batch, 3 QCs at the beginning and end of the batch, 2 blanks and 145 plasma samples. Sample acquisition was randomized and the QC was analyzed every 4 injections to monitor and correct changes in the instrument response. Chromatographic analysis was performed on an Agilent 1290 Infinity UPLC chromatograph using a UPLC BEH C18 (100 x 2.1 mm, 1.7 µm, Waters, Wexford, Ireland) column. Autosampler and column temperatures were set to 4°C and 40°C, respectively and the injection volume was 4 µl. Gradient elution was performed at a flow rate of 400 µl min-1 as follows: initial conditions of 98% of mobile phase A (H2O (0.1% v/v HCOOH)) were kept for 0.5 min, followed by a linear gradient from 2% to 20% of mobile

4

ACS Paragon Plus Environment

Page 4 of 23

Page 5 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

phase B (CH3CN (0.1% v/v HCOOH)) in 3.5 min and from 20% to 95% B in 4 min. 95% B was held for 1 min and then, a 0.25 min gradient was used to return to the initial conditions, which were held for 2.75 min. All solvents were of LC-MS grade and were purchased from Scharlau (Barcelona, Spain). Ultra-pure water was generated with a Milli-Q water purification system (Merck Millipore, Darmstadt, Germany). Formic acid (≥95%) was obtained from Sigma-Aldrich Química SA. Full scan MS data from 100 to 1700 m/z with a scan frequency of 6 Hz (1274 transients/spectrum) was collected on an iFunnel quadrupole time of flight (QTOF) Agilent 6550 spectrometer (Agilent Technologies, CA, USA) in the TOF MS mode. The following electrospray ionization parameters were selected: gas T, 200°C; drying gas, 14 l/min; nebulizer, 37 psig; sheath gas T, 350°C; sheath gas flow, 11 l/min. Automated UPLCESI(+)-QqTOF (MS/MS) analysis of the QC and a reduced set of samples was carried out to support compound’s identification using two collision energies (20, 40 V). Although ESI(-) was not employed in this study to increase the data acquisition frequency, future work will use fast polarity switch during data acquisition for a more comprehensive coverage. Automatic MS spectra recalibraton during analysis was carried out introducing a mass reference standard into the source via a reference sprayer valve using the 149.02332 (background contaminant), 121.050873 (purine) and 922.009798 (HP-0921) m/z as references. Data acquisition and manual integration of internal standards was carried out using MassHunter workstation (Agilent). Centroid raw UPLC-TOF-MS data was converted into mzXML format using ProteoWizard (http://proteowizard.sourceforge.net/) before generating peak tables using XCMS software (http://metlin.scripps.edu/xcms/). The centWave method was used for peak detection with the following parameters: m/z range = 100-850, ppm=20, peakwidth=(3, 20), snthresh=25. A minimum difference in m/z of 7.5 mDa was selected for overlapping peaks. Intensity weighted m/z values of each feature were calculated using the wMean function. Peak limits used for integration were found through descent on the Mexican hat filtered data. Peak grouping was carried out using the ‘nearest’ method using mzVsRT=1 and RT and m/z tolerances of 6 s and 20 mDa, respectively. After peak grouping, the fillPeaks method with the default parameters was applied to fill missing peak data. A total of 10246 features were initially detected after peak detection, integration, chromatographic deconvolution and alignment. Peak integration accuracy was assessed by comparing automated and manual integration results for internal standards (Figure S-1). Internal standards were used for checking the integrity of automated integration and for an initial analysis of instrument performance throughout the batch. The obtained peak table was imported into MATLAB R2014b (Mathworks Inc., Natick, MA, USA) for data analysis. Blank samples were used to identify and remove features arising from e.g. source contaminants, components from tubes or solvent impurities, leaving a data matrix X (188 x 2076), in which each row represents a

5

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 23

chromatogram and each column a metabolic feature (i.e. m/z - retention time). Systematic identification of selected discriminant metabolites was carried out matching m/z values against

the

Human

Metabolome

Database

(HMDB,

www.hmdb.ca)

and

METLIN

(//metlin.scripps.edu) databases with 5 ppm accuracy. To increase confidence in the identification and reduce the number of false hits, MS/MS fragmentation spectra were matched against reference spectra in HMDB, METLIN and mzcloud (www.mzcloud.org) databases. Tryptophan (Sigma-Aldrich Química SL, Spain), kynurenine (Sigma-Aldrich Química SL) and phenylacetylglutamine (LGC Ltd, UK) retention times and MS/MS spectra were acquired from the analysis of individual standards in H2O:CH3CN (98:2, 0.1% v/v HCOOH), prepared by direct dilution from individual DMSO stock solutions and analyzed employing the same UPLC-MS/MS conditions. Intra-batch effect correction Intra-batch effect correction was carried out using an approach based on quality control (QC) samples and Support Vector Regression (QC-SVRC)18. Using a Radial Basis Function kernel, the performance of the QC-SVRC is determined by an optimal selection of the ߝ-insensitive loss parameter, the error penalty C and the kernel parameter ߛ. The ε-insensitive loss parameter for each variable was selected as 7.5% of the median output value in QCs. The error penalty C was defined for each variable as the difference between the 10th and 90th quartile in QCs and the kernel parameter ߛ providing the lowest RMSE estimated by leave one out CV (RMSECV) was selected for each variable in the [2-3, 2-2,… , 26] range. Support Vector Regression was carried out in MATLAB using the LIBSVM library19. Data and MATLAB scripts used in this work are available from the authors. Decision Directed Acyclic Graph (DDAG)-PLSDA models The aim of this study was to identify a metabolic pattern in plasma characteristic of GC through disease progression within the Correa’s cascade. Accordingly, it involved the analysis of samples from patients diagnosed with non-active gastritis and no H. pylori infection (NAG-, n=40), chronic active gastritis and H. pylori infection (CAG+, n=40), precursor lesions of gastric cancer with or without H. pylori infection (PLGC+, n=20 or PLGC-, n=12, respectively) or gastric cancer (GC, n=33). PLGC+ and PLGC- were clustered in a single group (i.e. PLGC) regardless of H. pylori infection to simplify the analysis and keep a balanced number of samples across classes. This multi-class problem was analyzed through six independent binary problems implemented in a Decision Directed Acyclic Graph (DDAG)20 (see Figure 1). The DDAG assumes that each sample belongs to one of the classes and combines the results of the binary models to predict the class of new samples. To evaluate the DDAG for a new sample, the first node (i.e. model 1: NAG- vs GC) is evaluated. The node is exited via the left or right edge according to the outcome of the first discriminant model and then, the next node is evaluated. The path taken through the DDAG is the evaluation path where each node

6

ACS Paragon Plus Environment

Page 7 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

eliminates a class from the list of potential outcomes20. The predicted class of the sample is the value of the final node located in the last layer of the DDAG. PLSDA figures of merit used to evaluate the between-classes discrimination at each DDAG node were estimated by 7-fold double cross validation (2CV) to avoid overoptimistic results21. The random 7-fold split of the binary data sets into calibration and validation subsets during 2CV was repeated 5 times. The selection of the number of latent variables (LVs) in the inner loop of the 2CV was based on RMSECV values. PLS regression vectors were averaged and the mean vector was subsequently used for the identification of discriminant variables22. Predicted classes were estimated, for each sample, using the median predicted y values obtained after 5 random 2CV iterations. Sensitivity, specificity, positive and negative predictive values (PPV and NPV, respectively), the proportion of misclassified samples (NMC = false positive (FP) + false negatives (FN)) and the area under the receiver operator characteristic (AUROC) curve were selected as 2CV figures of merit. Sensitivity was estimated as the proportion of individuals with the target condition (i.e. GC in models 1, 2 and 4; PLGC in model 3 and 5; CAG+ in model 6) in whom the test is positive. Specificity was estimated as the proportion of individuals without the target condition in whom the test is negative. PPV was calculated as the proportion of test positives who have the target condition and NPV as the proportion of test negatives who do not have the target condition23. The ratio of NMC and the AUROC values range between 0 and 1. Random guessing leads to a diagonal line ROC curve and an area of 0.5. Likewise, if a binary classifier has no predictive performance, the average ratio of accurately classified samples in a balanced model is 50%. However, there is no threshold AUROC or NMC value corresponding to a good or null discrimination between groups. To overcome this drawback, the statistical significance of the figures of merit was assessed by a permutation test, in which null distributions were estimated by 1000 random permutations of the class labels24 and then, empirical p-values were computed as the fraction of permuted statistics that were at least as extreme as the test statistic obtained using real class labels. Besides, sensitivity, specificity, PPV and NPV were estimated along with exact binomial confidence limits23. During the assessment of the statistical significance of the PLSDA models at each DDAG node, a total of 1000 PLSDA models were calculated using random class labels. Using these results, a variable was classified as discriminant if its value in the mean PLS regression vector obtained using real class labels did not belong (p-value < 0.035) to the distribution obtained using randomly permuted class labels25,22. The predictive performance of the DDAG was evaluated by 2CV using both, the initial set of variables and the selected subsets of variables identified as discriminant for each PLSDA model. For each class, samples were randomly split into 5 cross-validation subsets. Then, one subset of each class was left out and the set of PLSDA models included in the DDAG was

7

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

developed using the remaining samples. The selection of the number of LVs of each PLSDA model was based on the RMSECV. Then, the validation samples were classified using the DDAG and the process was repeated until all samples were included once in the validation subset. The random 5-fold split into calibration and validation subsets was repeated 50 times and results were averaged. PLSDA was carried out using PLS Toolbox 8.0 (Eigenvector Research Inc., Wenatchee, USA) and in-house written MATLAB scripts. 3. RESULTS AND DISCUSSION UPLC-TOFMS data quality assessment The repeated analysis of a QC sample dispersed evenly throughout the batch, is a widely used strategy to obtain an accurate estimation of the instrumental variation26. Variables showing relative standard deviation (%RSD) in QCs above a certain threshold (e.g. 20%) can be classified as unreliable and removed from further analysis27,28, reducing the likelihood of observing chance correlations and increasing the precision of the multivariate discriminant models29,30. While this data-cleaning step leads to the elimination of noisy spurious variables, it can also lead to the elimination of metabolic features carrying biological information affected by intra-batch effects (i.e. gradual changes in the LC-MS performance over long batches). Intra-batch effects constrain the repeatability and reproducibility, the power to differentiate metabolic responses from uninformative signals and difficult the interpretation of the biological information provided31. To circumvent these potential pitfalls, an initial correction of intra-batch instrumental variation was carried out. Different strategies have been developed to fit non-linear functions to periodically injected QCs, using robust splines27,32,33 or support vector regression18, followed by a normalization of the complete data set to this curve. In this work, a Radial Basis Function kernel for a Support Vector Regression based correction (QC-SVRC)18 was employed. While QC-SVRC reduces intra-batch effects, it might lead to overly optimistic %RSD and to an ineffective variable elimination if the same QCs are used for both, the fitting of the non-linear functions and the identification of unreliable variables after correction. To avoid this trap, one of every two consecutive QCs was used for QC-SVRC and the remaining QCs were used for calculation of the %RSD of the metabolic features after correction. As an example, Figure S-2 shows peak area values of tryptophan (Trp) before and after QC-SVRC. Figure S-3(A) shows the cumulative distributions of %RSD in QCs for the 2076 metabolic features before and after QC-SVRC. This step increased the number of variables showing %RSD below a threshold value of 20% from 861 up 1212 and decreased the median value from 19% to 14.6%. Finally, in a second data cleaning step, variables were removed if the difference between the median value in QC and plasma samples was relatively large, evaluated using the Wilcoxon Signed-Rank test33

8

ACS Paragon Plus Environment

Page 8 of 23

Page 9 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(critical p-value = 10-16), leaving a subset of 1152 that were retained for further analysis distributed as shown in Figure S-3(B). The PLS predicted y values for QC replicates should be, irrespective of its value, ideally constant and so, its variation provides a straightforward way to test the effect of instrumental variation in the PLSDA outcome. Table S-1 summarizes the standard deviation of the predicted y values for the set of QCs left aside during QC-SVRC. The comparison of the standard deviations before and after intra-batch effect elimination (i.e. columns X and XQC-SVRC in Table S-1) showed that this step alone increased the precision of the PLSDA models built in this study. The elimination of noisy variables before or after QC-SVRC reduced, as expected, the precision bounds for all PLSDA models (i.e. columns Xclean and XQC-SVRC-clean in Table S-1). Nonetheless, the use of QC-SVRC allowed the use of a higher number of metabolic features while providing the highest precision levels. DDAG-PLSDA models for the multivariate analysis of GC progression Results from 2CV of the six PLSDA models included in the DDAG summarized in Table 2 showed metabolomic phenotypes that differed at p-values 0.05), nor the previously selected subset of 151 GC discriminant variables (NMC=53%, AUROC=0.33, pvalues > 0.05). Metabolite identification Metabolite annotation based on both MS and MS/MS data lead to the identification of tryptophan (Trp) phenylacetylglutamine (PAGN) and kynurenine (Kyn) among the variables commonly and exclusively selected as discriminant in models 1, 2 and 4 (see Figure 2). The analysis of individual Trp, PAGN and Kyn standard solutions in the same UPLC-MS/MS conditions allowed the identification of 19 and 4 variables as in-source fragment ions or adducts of Trp and PAGN, respectively (see Table 4 and Tables S-2, S-3 and S-4). Boxplots of the relative levels of these metabolites in the four groups of patients showed decreased relative levels of Trp in the GC group compared to the NAG-, CAG+ and PLGC groups (see Figure 3). This difference was also found statistically significant between NAGand CAG+ and PLGC groups. Kynurenine relative concentration in GC samples was found slightly higher than in NAG-, CAG+ and PLGC samples, but results were not statistically significant. Decreased relative levels of Trp in GC samples shown in Figure 3 are in agreement with upregulated expression in cancer patients of the Trp-metabolizing enzymes IDO and IDO2 and the liver enzyme tryptophan dioxygenase34,35 in cancer patients. Immune dysregulation is a key event for tumor evasion of the host immune system. IDO and IDO2 control the Trp catabolism signaling pathway generating kynurenine and other downstream catabolites that can modulate T-cell immunity and a local microenvironment that is starved for Trp. One of the major pathways by which IDO can affect T-cell immune response is via activation of supressive regulatory T (Treg) cells34,35,36,37. Results found for PAGN showed statistically significant increased relative levels in the GC group, but no change among NAG-, CAG+ and PLGC groups. PAGN is known to form in the liver from glutamine and phenylacetyl-

10

ACS Paragon Plus Environment

Page 10 of 23

Page 11 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

CoA. This result might indicate a deregulation of the phenylalanine or glutamine metabolism. PAGN is also a known microbial metabolite38,39 and so, observed changes of its plasmatic levels could either be attributed to the microbial or host metabolism or their interaction. To the best of our knowledge, the link of this result to GC is not known and further information is needed to assess the link with GC progression. H. pylori and age as confounding factors Previous results suggested that H. pylori infection could induce metabolic changes mediated by cytokines produced in the gastric mucosa. The epidemiologic link between infection, growth disturbances and altered lipid and glucose metabolism is disputable40. On the other hand, H. pylori is generally not present over areas of intestinal metaplasia in the gastric mucosa. Patients with extensive atrophy and/or intestinal metaplasia have generally a lowdensity infection or even they have lost H. pylori41. Therefore, if H. pylori infection is differently distributed between classes in the DDAG nodes and it modifies the metabolic profiles, then it would be a potential confounder as the models might be actually discriminating H. pylori infection. So, to test whether the observed differences were affected by H. pylori infection, metabolic profiles of PLGC+ and PLGC- groups were analyzed by PLSDA using again a 5 random 7-fold-2CV and a permutation test for the assessment of the figures of merit. From results obtained, no statistically significant difference was found (NMC=75% and AUROC=0.52, p-values>0.05). This result indicated that, for this particular data set, either the subtle changes due to the H. pylori infection did not have a significant effect on the metabolic profile or the analytical strategy was not powerful enough to detect them. In any case, it should be confirmed by further analysis in larger cohorts and also using complementary metabolomic approaches (e.g. HILIC-MS, gas chromatography-MS). The analysis was then carried out using the selected subset of 151 variables, obtaining again a non-statistically significant difference between PLGC+ and PLGC- groups of samples (NMC=53% and AUROC=0.33, p-values > 0.05). Older people are more likely to be diagnosed of GC and disease progression is confounded by age (see Table 1). Besides, additional factors associated with age such as lifestyle or body mass index may also be relevant. Age-adjusted multiple linear regression showed a statistically significant (p-value