Assessment of the quality of doping substances ... - ACS Publications

Apr 24, 2019 - The direct control of doping in sport is based on the analysis of ... ions of the mass spectrum of the analyte in a calibrator (Positiv...
1 downloads 0 Views 303KB Size
Subscriber access provided by KEAN UNIV

Article

Assessment of the quality of doping substances identification in urine by GC-MS/MS José Narciso, Susana Luz, and Ricardo J. N. Bettencourt da Silva Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.9b00560 • Publication Date (Web): 24 Apr 2019 Downloaded from http://pubs.acs.org on April 27, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Assessment of the quality of doping substances identification in urine by GC-MS/MS José Narciso,†,‡ Susana Luz,‡ Ricardo Bettencourt da Silva*,† †Centro

de Química Estrutural - Faculdade de Ciências da Universidade de Lisboa, Campo Grande, 1749-016 Lisboa, Portugal; email: [email protected]; ‡Laboratório de Análises de Dopagem, Av. Prof. Egas Moniz, 1600-190 Lisboa, Portugal ABSTRACT: The direct control of doping in sport is based on the analysis of active substances and/or their metabolites in urine samples of the athletes by GC-MSn or LC-MSn. The World AntiDoping Agency, WADA, defined criteria for the agreement between retention times, RT, or relative retention times, RRT, and abundance ratios, AR, of characteristic ions of the mass spectrum of the analyte in a calibrator (Positive Control) and the sample. Strict criteria for confirming analyte presence were defined to reduce false positive results rates, FP. However, these criteria can drive to high rates of false negative results, FN. This work presents a methodology to define statistically sound criteria for the agreement between RRT and AR that allow keeping the FN under control. It is also determined the FP of identifications. The statistical criteria were set from Monte Carlo Simulations of correlated RT and ion abundances. The simulation of AR and signal noise was also used to estimate the FN and FP of identifications based on the criteria defined by WADA. The developed tools were successfully applied to the control of nine doping substances in urine samples by GC-MS/MS. The estimated FN were tested from independent experimental tests proving estimates are accurate. The criteria defined by WADA are associated with extremely low FP but, in some cases, associated with FN much larger than 50 %. The statistically sound identification criteria allow a more convenient balance between FN and FP. The user-friendly spreadsheet used in this work is made available as Supporting Information. The direct control of the use of doping substances by athletes is performed through the analysis of these substances and/or their metabolites in urine samples. The false assessment of the use of these substances has an impact both on the sport competition and on the individual interests of the tested athlete1. The non-detection of the use of forbidden substances (i.e. false negative results) benefits the transgressors and encourages the use of those substances with a negative impact on the health of athletes2. On the other hand, the false identification of the presence of substances in the urine of an athlete focuses most negative impacts on the individual2. Therefore, although false negative results are undesirable, false positive identifications of doping substances in athletes’ urine have a more acute impact though this can be concentrated in few people3. Therefore, in doping control, the strategies used for the identification of forbidden substances are focused on reducing the probability of false positive results. However, as the probability of false positive results is reduced, it increases the false negative results rate (FN). The World Anti-Doping Agency (WADA) sets rules for the official monitoring of doping in sport competitions. These rules are set considering the need for harmonizing control procedures and performing an adequate management of the risk of false assessments. The WADA Technical Document TD2015IDCR4 sets criteria for the identification of doping substances in urine samples by LC-MSn or GC-MSn specifying limit values for the agreement between the retention times, RT, or relative retention times, RRT, and abundance ratios, AR, of characteristic ions of the mass spectrum of the analyte observed in calibrators (Positive Controls) and samples. The defined criteria are applicable to different analytes, urine matrices and instrumental conditions. WADA defined strict criteria for confirming the presence of compounds that reduces the chance of being reported false positive results. This invariably, involves a larger chance of false negative results that can be important in some occasions. 1/12 ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 12

WADA allows the analysis of samples collected ten years before using state-of-art technology and procedures that allows detecting substances undetected in the past5. The development of more sensitive and selective instrumentation have contributed, and for sure will continue to contribute, to improve doping analysis. Da Silva6 developed a methodology to define statistically sound criteria for RT, RRT and AR of characteristic ions of the mass spectrum of the analyte observed in unknown samples applicable to identifications performed by GC-MSn or LC-MSn. This methodology is based on the experimental determination of the dispersion and correlation of RT and ion abundance, A, from replicate injection of solutions with the analyte, and the modelling or simulation of the dispersion of RT or RRT, and AR used in analyte identification. This work developed tools to define statistically sound criteria of the agreement between RRT and AR of analyte observed in calibrators and samples injected under repeatability conditions (i.e. in the same day). In those cases, instead of modelling the variability of RT, RRT or AR, it is modelled the difference between these parameters observed in the calibrator and sample. These models and models of the signal from ‘blank’ samples (i.e. samples with undetected analyte) were used to estimate the true positive (TP) and false positive (FP) results rates subsequently used to estimate the uncertainty of identifications as likelihood ratios. The developed tools were successfully applied to the identification of some anabolic steroids, diuretics or masking agents, stimulants and cannabinoid in urine samples by GC-MS/MS. The performance of identifications based on statistically sound criteria for RRT and AR, was compared with the performance of identifications based on criteria defined by WADA. The Supporting Information includes a list of acronyms and symbols used in the text. Acronyms are presented in roman while symbols are in italic with a minimum number of letters and subscripts. THEORY Confidence intervals of identification parameters Although RT and A of a compound estimated from the replicate injections of a solution have an approximate normal distribution, the combination of pairs of RT of different compounds or A of the same compound in a RRT or AR produces a parameter (i.e. RRT or AR) with a distribution that can deviate significantly from the normality. The deviation from normality depends on the value, dispersion and correlation between the combined RT or A. When the identification of compounds is based on the comparison of RT or RRT, and AR of the analyte observed in a calibrator and the sample under repeatability conditions, it is determined the difference between RT, d(RT), or RRT, d(RRT), and AR, d(AR), and checked if these differences are within acceptance intervals for these parameters. The confidence intervals for d(RT) or d(RRT), and d(AR) can be estimated from the distribution of RT, RRT or AR, respectively. The d(RRT) and d(AR) tend to be more normally distributed than RRT and AR. The distribution of the RT, RRT, AR, d(RT), d(RRT) or d(AR) can be described analytically or by simulations from the Monte Carlo Method, MCM. Da Silva6 proposed the Monte Carlo Simulation of AR by using a MS-Excel file that takes the mean, standard deviation and Spearman’s correlation coefficient between both A. Equation (1) and (2) present the MS-Excel formulas used for the simulation of the abundances of two ions 1 and 2, 𝐴1(𝑖) and 𝐴2(𝑖), (simulation index i) associated with mean abundances 𝐴1 and 𝐴2, standard deviations 𝑠𝐴1 and 𝑠𝐴2, and the Spearman’s correlation coefficient between both abundances of 𝜌1;2: ∗ 𝐴1(𝑖) = 𝐴1 + 𝑠𝐴1 TINV(R1,𝜈𝐴1) ∗ ∗ ∗ 𝐴2(𝑖) = 𝐴2 + 𝑠𝐴2 (TINV(R1,𝜈𝐴2) 𝜌1;2 + TINV(R2,𝜈𝐴2) (1 ― 𝜌1;2^2)^0.5)

(1) (2)

where 𝜈𝐴1 and 𝜈𝐴2 are the number of degrees of freedom associated with 𝑠𝐴1 and 𝑠𝐴2, respectively, and R1 and R2 are two independent random values generators U(0,1) (Uniform distribution between 0 and 1) (MS-Excel formula: RAND()). Equivalent Monte Carlo Simulations can be 2/12 ACS Paragon Plus Environment

Page 3 of 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

performed for RRT and by performing the simulation of independent RRT and AR, it is possible to model d(RRT) and d(AR). There is no need to use the MCM for modelling d(RT) since RT has a normal distribution and d(RT) is also normal with a mean equivalent to zero and a standard deviation 2 larger than the standard deviation of RT7,8. The Supporting Information to this paper includes a spreadsheet where can be entered pairs of RT from the analyte and the internal standard, or pairs of A of ions of two fragments of the mass spectrum of the analyte, and are modelled the dispersion of RRT and d(RRT), or AR and d(AR). The file reports the 0.5th, 2.5th, 50th, 97.5th and 99.5th percentiles (P0.5, P2.5, P50, P97.5 and P99.5, respectively) of simulated parameters, where the pairs P0.5 and P99.5 or P2.5 and P97.5 limit the confidence intervals for 99 % or 95 % confidence levels, respectively. The confidence level of the confidence intervals corresponds to the TP of identifications based on the parameter. False positive results rate Da Silva6 proposed estimating FP associated with identifications based on AR by modelling the signal noise of the mass spectrometer. The integration of characteristic ions signals in the retention time window of the analyte, from samples with undetectable analyte levels, allows building signal noise models from the mean and standard deviation of these peaks. The distribution of the blank signal was modelled by the mean and standard deviation of the peaks by taking simulated onetailed t values of the Student’s t distribution where no negative signals are allowed by truncating the distribution below zero. When independently simulated signal noises of two ions produces an AR within the acceptance limits for this ratio (e.g. P2.5 and P97.5), a false positive result from AR is reported. The proportion of simulated AR within the acceptance limits given the total number of simulated AR, estimates the FP. Since at extremely low analyte levels the FP is high, it was studied the variation of FP as analyte level increases. This assessment involved adding more constraints to the simulated AR. The simulated AR targeted at a specific analyte level (e.g. analyte concentration) is compared with acceptance limits for the AR only if each A is larger than the minimum expected A value, 𝐴min, at the analyte level. Therefore, models of ions abundance based on the mean value, 𝐴, and standard deviation, 𝑠𝐴, of abundances, were built from replicated analysis of analyte solutions. The 𝐴min is estimated by equation 3. (3)

𝐴min = 𝐴 ― 𝑡(𝜈A,0.01)𝑠𝐴

where 𝑡(𝜈A,0.01) is the two-tailed t value of the Student’s t distribution for 𝜈A degrees of freedom and a confidence level of 99 %. For instance, if two ions, 1 and 2, are studied (i.e. combined in a AR), each simulated abundance, A1(i) and A2(i), are compared with the respective 𝐴min and if the A1(i) and A2(i) are not smaller than the limit (i.e. 𝐴1(𝑖) ≥ 𝐴min (1) and 𝐴2(𝑖) ≥ 𝐴min (2)) the AR (i.e. 𝐴1(𝑖) 𝐴 (𝑖)) is compared with the confidence interval of AR. If the simulated (𝐴1(𝑖) 𝐴 (𝑖)) is 2

2

within the limits set for AR, it is reported a false positive result: i.e. the wrong indication of analyte presence. In this work, since identifications are based on d(AR) estimated from a calibrator and the sample, are simulated ion abundances from calibrators and blank samples, and is determined the probability of abundances from blanks being high enough (i.e. have values larger than the respective 𝐴min) and of producing d(AR) within the respective acceptance limits. The FP from d(RRT) is not estimated from signal modelling. Instead it is considered a more pragmatic approach based on analysts’ worst-case deduction of the probability of an interferent producing a d(RRT) within the respective acceptance limits. From the experience of using the studied procedure in doping control it was considered a worst-case FP of 1 %. Alternatively, the FP from d(RRT) can be accurately estimated by counting the number of cases were a peak, not confirmed to be the analyte, presented a d(RRT) inside acceptance limit. The experimental determination of FP from d(RRT) requires that mass spectrum is selective enough to confirm analyte presence at relevant levels and involves the collection of a large number of tests. 3/12 ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 12

Ferrara et al9 discussed that for an identification associated with a FP of 1 %, the minimum number of blank tests required to observed one or more false response is 299 for a confidence level of 95 %. The limitation of the experimental determination of FP can be mitigated by combining data from different analytes assuming FP are equivalent for all analytes. Likelihood ratio The TP and FP estimated for analyte identification at a specific level, based on the d(RRT) or d(AR), are combined in the likelihood ratio, LR, (LR = TP/FP) that expresses identification quality. According to criteria proposed by the UK’s Association of Forensic Science Providers10, when LR is between (103 and 104), (104 and 106) or larger than 106, the evidences that support the identification are considered ‘Strong’, ‘Very Strong’ or ‘Extremely Strong’, respectively. Therefore, given the impact of doping analysis on relevant interests, the identifications should be associated with a LR larger than 106. A LR of 106 associated with a positive result denotes that the positive result is 106 more likely truth than false. When analyte identification is based both on the observed d(RRT) and d(AR), the LR of identifications is estimated by multiplying LR associated with d(RRT) and d(AR). For instance, if the acceptance criteria for the d(RRT) and d(AR) are defined for a 95 % confidence level, which represents the TP, and FP from d(RRT) and d(AR) are 1 % and 0.001 %, respectively, the LR from identification becomes 9.03×106 (9.03 × 106 = (95/1)(95/0.001). In this case, identifications based alone on d(RRT) and d(AR) are not conclusive since are associated with LR of 95 and 95000, respectively, but the combination of these two independent evidences constitutes and extremely strong support of analyte presence. If the tested urine has the same probability of having or not having the analyzed substance, the LR can be converted in the probability of positive result being truth, P, by LR/(LR+1). For instance, a positive result associated with a LR of 9.03 × 106 is associated with a probability of being correct of 99.99999 % (five nines after the decimal place). The Bayes’ theorem is the mathematical framework for the determination of P from the LR6,11-13. Quality of identifications based on WADA’s criteria The WADA defines that the presence of a compound in a urine sample is confirmed if the difference between RT or RRT, and AR observed in the analyte of a calibrator (Positive Control) and in a sample peak is smaller than defined maximum limits. Both the calibrator and sample should be injected under repeatability conditions (i.e. in the same day/analytical run). If the RT, 𝑡r, is used in the identification, the absolute value of d(RT), |𝑑(𝑡r)|, should be smaller than the retention time of the analyte in the calibrator, 𝑡r(C), times 0.1 (i.e. 𝑡r(C)0.1) or 0.1 min (equation 4), “whichever is greater, but not exceeding the full-width-at-half-maximum”4. The |𝑑(𝑡r)|, ΔRT in the notation used by WADA4, is determined by |𝑡r(S) ― 𝑡r(C)|, where 𝑡r(S) is the retention time of the sample peak.

|𝑑(𝑡r)| = |𝑡r(S) ― 𝑡r(C)| ≤ 0.1𝑡r(C) or 0.1 min

(4)

If instead, the RRT, 𝑡Rr, (𝑡Rr = 𝑡r 𝑡r(IS); where 𝑡r(IS) is the retention time of the internal standard) is considered referenced to the isotope-labelled analyte or another compound as internal standard, the absolute value of the difference between RRT, |𝑑(𝑡Rr)|, observed in the sample and calibrator should be smaller than 0.5 % or 1 % of the RRT of the analyte in the calibrator, 𝑡Rr(C) (equation 5). The calibrator and the sample should be injected under repeatability conditions.

|𝑑(𝑡Rr)| = |𝑡Rr(S) ― 𝑡Rr(C)| ≤ 0.005𝑡Rr(C) or 0.01𝑡Rr(C)

(5)

The criteria proposed by WADA for identifying compounds from their mass spectrum obtained after a multi-stage fragmentation involves processing signals of at least two precursor-product ion transitions. The abundance of ions should be larger than three times the signal-to-noise ratio, the AR should be determined by dividing the abundance of the less abundant ions by the abundance of the more abundant ion, and the absolute value of d(AR) from analyte observed in the sample and calibrator should be smaller than the limits set in Table 1. The selectivity of monitored ion transitions increases with the mass/charge ratio of precursor and product ion(14). 4/12 ACS Paragon Plus Environment

Page 5 of 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Since the identification criteria defined by WADA are general, i.e. not based on experimental data of identification performance, it is useful to determine the TP and FP of identifications based on these criteria. This information can be useful to understand if the proposed criteria adequately manages the risk of false identification and how occasional fragilities in criteria can be mitigated. EXPERIMENTAL The developed methodologies for setting statistically sound criteria for 𝑑(𝑡Rr) and 𝑑(𝐴𝑅), for estimating FP from blank signal models and for assessing the performance of identification criteria defined by WADA were used to study the identification of various anabolic steroids, diuretics and masking agents, stimulants and a cannabinoid in urine samples by GC-MS/MS. It was studied the performance of identifications at the Minimum Required Performance Level (MRPL) or Threshold (T) defined by WADA15,16 and one-fourth, half and two times this level, L (e.g. L/4, L/2 and 2L). Table 2 presents the list of studied compounds and the respective MRPL or T. The regulation of nandrolone control has changed being currently required the identification of lower 19-NE levels 17,18,19. The Supporting Information lists used chemicals and materials, and describes the analytical procedure. Validation procedure The collection of data for modelling d(RRT) and d(AR) involved repeating the injection of the following six solutions in the GC-MS/MS in 14 different days: Calibrator, as ‘blank’ urine X spiked at the MRPL or T (L), and urine Y unspiked and spiked at L/4, L/2, L or 2L. Between the injections of calibrator and, unspiked and spiked urine Y it was injected other urine samples to simulate a large samples batch. The same urine X was used to prepare the calibrator but 14 different urines Y were considered in each analysis day. Independent assessment of the procedure The defined acceptance limits ‒ statistical and WADA’s limits ‒ were tested through the analysis of five different urine samples, spiked at various analyte levels, and independent of the urine samples used for setting the limits. The ‘blank’ urines were spiked at L/4, L/2, L and 2L. Since 12 analytes were tested in 20 samples (5 samples times four spiking levels), it was performed 240 checks of each limit type. These assays were used the check the estimated TP and FN. It is not feasible to test FP of highly selective determinations from direct experimentation; i.e. without modelling available experimental data. RESULTS AND DISCUSSION Setting and performance of identification criteria Table 3 and Table S2 of the Supporting Information present the most relevant performance parameters of analytes identification based on compound retention in the chromatographic system. It is reported the mean, 𝑡r, of retention times collected in different days, the standard deviation, 𝑠𝑡r, of the 𝑡r obtained under repeatability conditions (i.e. same day) and the Spearman’s correlation coefficient, ρ, between the 𝑡r of the analyte and internal standard. These parameters are used to model 𝑡Rr and 𝑑(𝑡Rr). The precision was estimated under repeatability conditions since these are the relevant conditions for the routine analysis as calibrators (Positive Controls) and samples are injected in the same day. It is also reported the P2.5 and P97.5 for the statistical control of 𝑑(𝑡Rr) at a 95 % confidence level, the maximum absolute 𝑑(𝑡Rr), |𝑑(𝑡Rr)(Max)|, defined by WADA and the TP associated with WADA’s criterion. The confidence level of the statistical limits for 𝑑(𝑡Rr) corresponds to the TP of identifications. It is also reported the P2.5, P50 and P97.5 of 𝑡Rr simulation to allow observing the small asymmetry of this parameter (P50P2.5 ≠ P97.5-P50). The retention time performance was assessed at four analyte levels, namely: L/4, L/2, L and 2L.

5/12 ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 12

The 𝑡r can vary more with the day than the predicted by 𝑠𝑡r due to, for instance, liner changing or column cutting. The 𝑠𝑡r was estimated from the mean, 𝑅, of ranges of retention time duplicates (𝑠𝑡r = 𝑅 1.128). It can be observed from Table S2 that 𝑡r, 𝑠𝑡r and ρ vary with the analyte but do not vary significantly with the analyte level between L/4 and 2L. The correlation between the 𝑡r of the analyte and internal standard is strong as expected. Although 𝑡Rr distribution can be slightly asymmetric, the 𝑑(𝑡Rr) are approximately symmetric since the (-1)P2.5 ≅ P97.5. The statistical limits for 𝑑(𝑡Rr) at a confidence level of 95 % are stricter than the |𝑑(𝑡Rr)(Max)| proposed by WADA (1 % of 𝑡r). Therefore, the criteria defined by WADA drives to higher TP and lower FN; i.e. the WADA criteria should not miss an identification by observing retention times. For simplicity, it was assumed that both the statistical and WADA’s criteria for 𝑑(𝑡Rr) are associated with a FP of 1 % although the WADA’s criteria filter less retention time interferents. WADA rely mostly on the mass spectrometer to avoid false positive results. It can be considered that the wider tolerance for 𝑑(𝑡Rr) proposed by WADA allows comparing 𝑡r from larger differences in analyte concentrations between the calibrator and the samples than the studied in this work. Table 3 and S3 present the most relevant performance parameters and criteria for the identification of analytes from AR of characteristic ions of the mass spectrum. It is reported the Spearman’s correlation coefficient, ρ, of ions abundance, the statistical control limits of 𝑑(𝐴𝑅), the blank signal variability limits as a percentage of most abundant ion signal at the MRPL or T, the maximum absolute 𝑑(𝐴𝑅), |𝑑(𝐴𝑅)(Max)|, defined by WADA (Table S3), the FP of identifications based on the statistical criteria, and the TP and FP from identifications based on WADA’s criteria. Statistical limits were set at 95 % confidence level. It can be observed that there is a strong correlation between ions abundance and the respective Spearman’s correlation coefficient does not vary significantly between L/4 and 2L. In many cases, the simulated 𝑑(𝐴𝑅) are not symmetric due to differences in AR observed at the calibrator (MRPL or T) and other concentration levels and/or the fact that AR are also asymmetric. The asymmetry of AR increases as increases the dispersion of the abundance of the most abundant ion positioned in the denominator of the AR6. The wideness of the statistical intervals is approximately constant or decreases as concentration increases with one exception for 5β-T. The analysis of 5β-T by GCMS/MS is known to be problematic due to the low sensitivity to this compound. The limits are wider at lower concentrations because the relative dispersion of signal increases at lower analyte levels. The blank signal is not expected to affect significantly determinations at the MRPL or T since the maximum P95 of simulated signal is 15 % of the mean abundance of the most abundant ion for the determination of 5β-T. In most of the cases, the |𝑑(𝐴𝑅)(Max)| defined by WADA is stricter than the statistical limits driving to FN larger than the 5 % associated with the statistical control (FN = 1 – FP). Nevertheless, for the analysis of Ep, Tr and Cr at MRPL and 2MRPL, and C-THC and Cn between L/4 and 2L, the criteria defined by WADA is wider than the statistical one producing less FN. Figure S1 presents the variation of FN and FP of analyte identifications, with the concentrations level, by controlling 𝑑(𝐴𝑅) statistically or by using WADA’s criterion. The criteria defined by WADA can drive to high, and in some cases extremely high, FN. Only for the analysis of Ep, C-THC, Cn, Tr and Cr at L and 2L, the FN are smaller than 5 % when WADA’s criteria is considered. For the identifications of C-THC and Cn the FN are below 5 % even at L/4 and L/2. For identifications based on the statistical control of 𝑑(𝐴𝑅), the FP are only problematic for the identification of 19-NA, Am and Ot at L/4 (FP = 21 %, 7.5 % and 53 %, respectively) and Ot at MRPL/2 (FP = 9.5 %). Therefore, when the statistical criteria for 𝑑(𝐴𝑅) is considered, the FP is not a problem at and above L. When WADA’s criteria for 𝑑(𝐴𝑅) are considered, the FP are always negligible even at L/4 and L/2. Table S4 presents the calculations of LR from identifications based on 𝑑(𝑡Rr) or/and 𝑑(𝐴𝑅) where statistical or WADA’s criteria are considered. The LR “Comb.” is applicable to cases where the 6/12 ACS Paragon Plus Environment

Page 7 of 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

identity is only confirmed when both estimated 𝑑(𝑡Rr) and 𝑑(𝐴𝑅) for the sample met the respective criteria. Assuming that an identification is only conclusive if LR is larger than 106, the statistical control of identification parameters allows producing sound positive results at least at the L and above that level. Only for six analytes at L/4 and/or L/2 it is not possible to provide conclusive identifications due to higher rates of FP; a FP associated with 𝑑(𝐴𝑅) larger than 0.01 % produces a LR(“Comb”) lower than 106. The WADA’s criteria allow reporting sound positive results for all analytes except 19-NA and 19-NE due to a combination of low TP and high FP. The WADA’s criteria also does not allow sound identifications of 5α-T and Md below MRPL. Experimental assessment of identification criteria Table S5 presents the results of the analysis of spiked urine samples used to assess the statistical and WADA’s criteria for the identification of compounds. The table presents the estimated 𝑑(𝑡Rr) and 𝑑(𝐴𝑅) of the analysis of each sample and identifies cases where the statistical (S) and/or WADA’s (W) criteria fail to confirm the presence of the analyte. The statistical criteria failed in 4.6 % of the cases which is equivalent to the 5 % of fails expected for controls at a confidence level of 95 %. The WADA’s criteria failed in 16 % of the cases confirming the high FN determined by Monte Carlo Simulations. If the information from Table S5 is analysed more carefully, it can be observed that the statistical control of 𝑑(𝑡Rr) never fails while the detailed control of 𝑑(𝐴𝑅) fails occasionally for an analyte and failed in the last run of checks of 6β-H determination. These results are encouraging regarding the statistical control of the identifications. The WADA’s criteria produced a large FN but the fails are concentrated in the same analytes. The WADA’s criteria for 𝑑(𝑡Rr) never failed since it is wider than the statistical one. For 5β-T, 6β-H and Md, the criteria for 𝑑(𝐴𝑅) failed in 12, 11 or 8 out of 20 checks (i.e. 60 %, 55 % or 40 % of the cases, respectively). These values are compatible with the FN estimated from modelling considering that only 20 tests were performed. The FN of identifications based on the regulated WADA’s criteria can be overcome if samples are injected more than one time and the analyte presence is reported if in at least one injection both 𝑑(𝑡Rr) and 𝑑(𝐴𝑅) met the WADA’s identification criteria. If a FN of 50 % is observed in single injections, the duplicate injections would reduce FN to 25 % (i.e. 50 % × 50 %). For sure this methodology also increases the FP, but since these are unlikely, that risk is not a problem. The statistical assessment of WADA’s criteria allow identifying cases where FN are too large. Ideally, WADA should also allow using alternative methodologies for the identification of compounds based on performance observed experimentally. Since the developed identification criteria was applied to a limited number of urines, it should be applied to more urines to assess if other interferents can affect identifications. The MS-Excel spreadsheet used to simulate 𝑑(𝑡Rr) and 𝑑(𝐴𝑅) and to define limits for these parameters is made available as Supporting Information. CONCLUSIONS The developed methodology for determining the dispersion and correlation between the retention time of an analyte and an internal standard or the abundance of two ions of the mass spectrum of a compound, and the subsequent Monte Carlo simulation of differences, 𝑑(𝑡Rr), between relative retention times or, 𝑑(𝐴𝑅), ion abundance ratios of the analyte observed in a calibrator and a sample, were successfully used in doping analysis. The simulation of 𝑑(𝑡Rr) and 𝑑(𝐴𝑅) allowed defining acceptance limits for these parameters for a defined confidence level (i.e. 95 %) applicable to the identification of the analyte in unknown samples. The confidence level of these limits defines the true positive results rate, TP. The simulation of sample signals with undetected levels of analyte (‘blank samples’) was successfully used to estimate the chance of signal noise producing large enough ion abundances that produce a 𝑑(𝐴𝑅) within their acceptance limits which generates a false positive result. This simulation of blank signal allowed estimating the false positive results rates, FP. The developed methodology for setting and assessing statistical criteria for compounds identification by GC-MS was successfully applied to the analysis of 12 doping agents or their 7/12 ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 12

metabolites in urine samples at the Minimum Required Performance Levels or Threshold (L) defined by WADA and below and above this limit: L/4, L/2 and 2L. The TP and FP of identifications based on 𝑑(𝑡Rr) and 𝑑(𝐴𝑅) (TP1, TP2, FP1 and FP2, respectively) were combined in the likelihood ratio, LR, of identifications (LR = TP1/FP1·TP2/FP2) based on both identification parameters. Only identifications associated with LR larger than 106 are considered conclusive. The statistical control of 𝑑(𝑡Rr) and 𝑑(𝐴𝑅) allows the conclusive identification of all the studied analytes at the L or 2L. For half of the analytes, is also possible to identify compounds at L/4 and L/2. Since WADA regulated identification criteria for 𝑑(𝑡Rr) and 𝑑(𝐴𝑅) regardless of the observed performance of identifications, these criteria were tested with simulated 𝑑(𝑡Rr) and 𝑑(𝐴𝑅) values. The simulated 𝑑(𝑡Rr) and 𝑑(𝐴𝑅) for spiked urines or ‘blank’ urines that do not met or met WADA’s criteria are false negative or false positive results, respectively. The WADA’s criteria were tested at four analyte levels: L/4, L/2, L and 2L. The assessment of WADA’s criteria allowed concluding that they are associated with negligible FP, and in 19 % of the cases smaller than those associated with identification parameters statistical control. However, the strict criteria for 𝑑 (𝐴𝑅) defined by WADA can be associated with extremely high FN (i.e. the probability of analyte presence not being confirmed). This FN can be reduced by the replicate injection of the sample solution and reporting a positive result if in at least one injection both WADA’s criteria for 𝑑(𝑡Rr) and 𝑑(𝐴𝑅) are met. In fact, the WADA criteria for 𝑑(𝑡Rr) is not too strict being the criteria defined for 𝑑(𝐴𝑅) responsible for differences between the statistical and regulated controls. The WADA criteria allow reporting the identification of all analytes with a LR larger than 106 except in two cases: the metabolites of nandrolone. Therefore, the WADA’s criteria is not fit for doping control by this substance using the studied procedure. The quantified performance of identifications based on statistical and WADA’s criteria from modelling, namely the FN, was confirmed through the analysis of five urine samples, spiked at four concentrations levels (20 checks per analyte), independent of samples used to set statistical identification criteria. The statistical control of the analyte’s presence failed in 4.6 % of the cases, which is concordant with limits’ confidence level (i.e. 95 %). The WADA’s criteria failed in 16 % of the cases, where FN reached the values of 60 %, 55 % and 40 % for the identifications of 5βT, 6β-H and Md, respectively. Since the WADA’s criteria is the regulated one, it must be used in official doping control. However, this work suggests WADA should assess the need to define identification criteria from the performance observed experimentally in the laboratory. Meanwhile, the laboratories are advised to perform a parallel identification of analytes from defined statistical limits to know how high FN can be, to select which procedures should be improved and to select which samples should be tested within 10 years with more selective procedures. ASSOCIATED CONTENT Supporting Information The supporting is available free of charge on the ACS Publications website at DOI: (…). The Supporting information includes a list of acronyms and symbols used in the text, the description of the analytical procedure, and more data about identification criteria and performance. The excel file used to define identification criteria is also provided. ACKNOWLEDGEMENTS This work was supported by Fundação para a Ciência e Tecnologia (FCT) under project UID/QUI/00100/2013 and SFRH/BPD/110186/2015. REFERENCES (1) Verroken, M., Drug use and abuse in sport, Baillieres. Clin. Endocrinol. Metab. 2000, 14, 1– 23. (2) Kohler, R. M. N.; Lambert, M. I., Urine nandrolone metabolites: false positive doping test?, Brit. J. Sports Med. 2002, 36, 325–329. (3) Berry, D. A., The science of doping, Nature 2008, 454, 692–693. 8/12 ACS Paragon Plus Environment

Page 9 of 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

(4) World Anti-Doping Agency - Laboratory Expert Group WADA Technical Document – TD2015IDCR: Minimun Criteria for Chromatographic-Mass Spectrometric Confirmartion of the Identity of Analytes for Doping Control Purposes; WADA, 2015. (5) World Anti-Doping Agency Athlete Reference Guide to the 2015 World Anti-Doping Code; WADA, 2014. (6) da Silva, R. J. N. B., Evaluation of trace analyte identification in complex matrices by lowresolution gas chromatography - Mass spectrometry through signal simulation, Talanta 2016, 150, 553–567. (7) International Organization for Standardization Accuracy (trueness and precision) of measurement methods and results—part 6: use in practice of accuracy values: ISO 5725-6:1994; ISO (Geneve), 1994. (8) da Silva, R. J. N. B.; Silveira, D. M.; Camões, M. F.; Borges, C. M. F.; Salgueiro, P. A. S., Validation, Uncertainty and Quality Control of Qualitative Analysis of tear gas weapons by Gas Chromatography-Mass Spectrometry, Anal. Lett. 2014, 47, 250–267. (9) Ferrara, S.D.; Tedeschi, L.; Frison, G.; Brusini, G.; Castagna, F.; Bernadelli, B.; Soregaroli, D., Drugs-of-abuse testing in urine: statistical approach and experimental comparison of immunochemical and chromatographic techniques, J. Anal. Toxicol. 1994, 18, 278-291. (10) UK’s Association of Forensic Science Providers, Standards for the formulation of evaluative forensic science expert opinion, Sci. Justice 2009, 49, 161–164. (11) Ellison, S. L. R.; Gregory, S.; Hardcastle, W. A., Quantifying uncertainty in qualitative analysis, Analyst 1998, 123, 1155–1161. (12) Ríos, A.; Barceló, D.; Buydens, L.; Cárdenas, S.; Heydorn, K.; Karlberg, B.; Klemm, K.; Lendl, B.; Milman, B.; Neidhart, B.; Stephany, R. W.; Townshend, A.; Zschunke, A.; Valcárcel, M., Quality assurance of qualitative analysis in the framework of the European project ’MEQUALAN’, Accred. Qual. Assur. 2003, 8, 68–77. (13) Pulido, A.; Ruisánchez, I.; Boqué, R.; Rius, F. X., Uncertainty of results in routine qualitative analysis, Trends Anal. Chem. 2003, 22, 647–654. (14) Renaud, J.B.; Sabourin, L.; Topp, E.; Sumarah, M.W., Spectral Counting Approach to Measure Selectivity of High-Resolution LC−MS Methods for Environmental Analysis, Anal. Chem. 2017, 89, 2747−2754. (15) World Anti-Doping Agency - Laboratory Expert Group Minimum Required Performance Levels for Detection and Identification of Non-Threshold Substances: WADA Technical Document – TD2015MRPL Document; WADA, 2015. (16) World Anti-Doping Agency - Laboratory Expert Group Decision limits for the confirmatory quantification of threshold substances: WADA Technical Document – TD2014DL Document; WADA, 2014. (17) World Anti-Doping Agency - Laboratory Expert Group Minimum Required Performance Levels for Detection and Identification of Non-Threshold Substances: WADA Technical Document – TD2018MRPL Document; WADA, 2018. (18) World Anti-Doping Agency - Laboratory Expert Group Decision limits for the confirmatory quantification of threshold substances: WADA Technical Document – TD2019DL Document; WADA, 2019. (19) World Anti-Doping Agency - Laboratory Expert Group Harmonization of analysis and reporting of 19-Norsteroids related to nandrolone: WADA Technical Document – TD2019NA Document; WADA, 2019.

9/12 ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 12

Table 1. Criteria defined by WADA the absolute values of d(AR) observed from analyte present in the sample and calibrator (Positive Control). The AR is determined by the ratio between abundances of the less and most abundant ions (or transition). AR (% of the base peak) Identification criterion 50 to 100 |𝐴𝑅(S) ― 𝐴𝑅(C)| ≤ 10 % 25 to 50 |𝐴𝑅(S) ― 𝐴𝑅(C)| ≤ 0.2 ∙ 𝐴𝑅(C) 1 to 25 |𝐴𝑅(S) ― 𝐴𝑅(C)| ≤ 5 % 𝐴𝑅(C) and 𝐴𝑅(S) are the abundance ratios of the analyte observed in the calibrator (Positive Control) and sample, respectively. Table 2. List of studied compounds and respective Minimum Required Performance Level (MRPL) or Threshold (T) defined by WADA for doping control through urine analysis. MRPL or T Analyte, acronym Class (ng mL-1) 19-Norandrosterone, 19-NA (a) AS 2(T) (a) 19-Noretiocholanolone, 19-NE AS 5 AS 2 5β‐Tetrahydromethyltestosterone, 5β-T (b) 6β-Hydroxymethandienone, 6β-H (b) AS 2 Epimetendiol, Ep (b) AS 2 Carboxy-tetrahydrocannabinol, C-THC (c) C 150(T) Amiloride, Am D&MA 200 Canrenone, Cn D&MA 200 Triamterene, Tr D&MA 200 Carphedon, Cr S 100 Modafinil, Md S 100 Octopamine, Ot S 1000 (a) – metabolite of nandrolone; (b) – metabolite of methandienone; (c) – metabolite of tetrahydrocannabinol. (T) - threshold value AS - anabolic steroid; D&MA - diuretic & masking agent; S – Stimulant; C – Cannabinoid

10/12 ACS Paragon Plus Environment

Page 11 of 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

Analytical Chemistry

Table 3. Performance of compounds identification based on statistical or WADA’s criteria for 𝑑(𝑡Rr) or 𝑑(𝐴𝑅). 𝑑(𝑡Rr) Analyte

19-NA

19-NE

5β-T

6β-H

Ep

C-THC

Conc.

T/4 T/2 T 2T MRPL/4 MRPL/2 MRPL 2MRPL MRPL/4 MRPL/2 MRPL 2MRPL MRPL/4 MRPL/2 MRPL 2MRPL MRPL/4 MRPL/2 MRPL 2MRPL T/4 T/2 T 2T

𝑑(𝑡Rr)

𝑑(𝐴𝑅)

WADA

Statistical

WADA

WADA

TP (%)

FP (%)

TP (%)

FP (%)

100.00 99.99 100.00 100.00 99.97 99.95 99.96 99.95 100.00 100.00 100.00 100.00 99.42 99.24 99.33 99.30 100.00 100.00 100.00 100.00 99.97 99.96 99.96 99.96

21.1 1.7 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.057 0.001 0.001 0.001 0.703 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001

57.9 65.2 66.6 77.3 82.5 74.4 91.7 93.8 1.9 19.0 49.7 12.4 25.7 50.2 82.6 83.7 80.7 91.4 99.1 99.1 97.1 98.2 98.9 98.8

0.012 0.012 0.012 0.008 0.017 0.013 0.010 0.012 0.002 0.002 0.003 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.003 0.005 0.008 0.007

Analyte

Am

Conc.

MRPL/4 MRPL/2 MRPL 2MRPL MRPL/4 MRPL/2 MRPL 2MRPL MRPL/4 MRPL/2 MRPL 2MRPL MRPL/4 MRPL/2 MRPL 2MRPL MRPL/4 MRPL/2 MRPL 2MRPL MRPL/4 MRPL/2 MRPL 2MRPL

Cn

Tr

Cr

Md

Ot

11/12 ACS Paragon Plus Environment

𝑑(𝐴𝑅)

WADA

Statistical

WADA

WADA

TP (%)

FP (%)

TP (%)

FP (%)

100.00 100.00 100.00 100.00 97.62 97.30 97.30 97.29 100.00 99.99 99.99 100.00 100.00 100.00 100.00 100.00 99.94 99.99 100.00 100.00 100.00 100.00 100.00 100.00

7.5 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.26 0.009 0.001 0.001 52.64 9.53 0.001 0.001

58.18 65.42 81.91 86.23 97.43 96.84 97.57 98.33 92.36 94.97 95.70 98.11 92.39 94.07 96.97 97.82 23.75 38.33 49.60 48.01 39.91 50.08 83.03 81.47

0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.005 0.005 0.004 0.001 0.001 0.001 0.001 0.001

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 12

For TOC only:

12/12 ACS Paragon Plus Environment