and Species-Level Sequence Divergence for Community and Isolate

Feb 22, 2007 - Management, University of California, Berkeley, California 94720, and Oak .... unsequenced isolates and communities based on their simi...
0 downloads 0 Views 401KB Size
Implications of Strain- and Species-Level Sequence Divergence for Community and Isolate Shotgun Proteomic Analysis Vincent J. Denef,*,† Manesh B. Shah,§ Nathan C. VerBerkmoes,§ Robert L. Hettich,§ and Jillian F. Banfield*,†,‡ Department of Earth and Planetary Science and Deparment of Environmental Science, Policy and Management, University of California, Berkeley, California 94720, and Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831 Received February 22, 2007

The recent surge in microbial genomic sequencing, combined with the development of high-throughput liquid chromatography-mass-spectrometry-based (LC/LC-MS/MS) proteomics, has raised the question of the extent to which genomic information of one strain or environmental sample can be used to profile proteomes of related strains or samples. Even with decreasing sequencing costs, it remains impractical to obtain genomic sequence for every strain or sample analyzed. Here, we evaluate how shotgun proteomics is affected by amino acid divergence between the sample and the genomic database using a probability-based model and a random mutation simulation model constrained by experimental data. To assess the effects of nonrandom distribution of mutations, we also evaluated identification levels using in silico peptide data from sequenced isolates with average amino acid identities (AAI) varying between 76 and 98%. We compared the predictions to experimental protein identification levels for a sample that was evaluated using a database that included genomic information for the dominant organism and for a closely related variant (95% AAI). The range of models set the boundaries at which half of the proteins in a proteomic experiment can be identified to be 77-92% AAI between orthologs in the sample and database. Consistent with this prediction, experimental data indicated loss of half the identifiable proteins at 90% AAI. Additional analysis indicated a 6.4% reduction of the initial protein coverage per 1% amino acid divergence and total identification loss at 86% AAI. Consequently, shotgun proteomics is capable of cross-strain identifications but avoids most crossspecies false positives. Keywords: proteomics • strain variation • community genomics • metagenomics • liquid chromatography • mass spectrometry • Leptospirillum • modeling • sequence divergence • evolution

Introduction Currently there are ∼400 microbial genomes (http://img.jgi.doe.gov/) and several community and metagenomics datasets available.1-7 Post-genomic experiments to utilize this wealth of information are increasingly being performed to unravel complex questions about the behavior of organisms and communities. The two main experimental strategies for this are transcriptomics, using microarrays, and proteomics. Proteomics’ primary goal is the identification of as many proteins as possible from a complex biological sample with high accuracy. In the 1990s, the coupling of 2D-PAGE with mass spectrometry (2D-PAGE-MS) was the dominant methodology for analyzing microbial proteomes.8 While 2D-PAGE initially dominated microbial work and was even implemened for * Corresponding authors. E-mails: [email protected] (V.J.D.) and [email protected] (J.F.B.). † Department of Earth and Planetary Science, University of California. ‡ Deparment of Environmental Science, Policy and Management, University of California. § Oak Ridge National Laboratory.

3152

Journal of Proteome Research 2007, 6, 3152-3161

Published on Web 06/28/2007

analysis of microbial communities,9 recent technical developments in nanoscale HPLC and high-performance MS have altered this landscape. The coupling of liquid chromatography with tandem mass spectrometry (LC-MS/MS) through electrospray ionization has many advantages over traditional 2DPAGE-MS.10,11 Multidimensional protein identification technologies based on LC/LC-MS/MS and automated database searching, coined MudPIT or shotgun proteomics,12 has enabled robust high-throughput proteomic analysis. With recent instrumental advances13,14 2D-LC-MS/MS methods now allow the identification in a single experiment of up to half of the predicted proteins in isolates15-18 and organisms within communities.19,20 It is now relatively straightforward to obtain proteomic data from organisms for which genomic data sets are available. An important next step is to determine the extent to which these genomic data sets can be used for proteomics of organisms for which no sequence data exist. Even though sequencing costs are steadily decreasing, it would be impractical to have to acquire extensive genomic data from every sample or isolate 10.1021/pr0701005 CCC: $37.00

 2007 American Chemical Society

Strain- and Species-Level Sequence Divergence

analyzed. Although this issue has been raised before,21 no quantitative evaluation of the effects of sequence variation on protein identification levels has been performed thus far. Some ways to circumvent the confounding issue of amino acid substitutions on peptide/protein identification have been proposed. These include analysis of intact proteins (topdown approaches22) and de novo sequencing of peptides combined with matching to proteins in the current genomic databases (MS-BLAST21,23). At this time, the major drawback of either one of these alternatives is their limited throughput and lack of adequate computational tools for data mining/ analysis. Previously, we acquired genomic data from two related natural microbial biofilm communities involved in acid mine drainage formation.6,20 The dominant organisms in these communities promote pyrite (FeS2) dissolution by accelerating the rate of oxidation of ferrous iron.24-26 The biofilms are dominated by different Leptospirillum group II strains,6,27 which share an average amino acid identity (AAI) of 95% between their orthologs.20 With these genomic data as a basis, we have initiated proteomic analyses19 on a series of samples from the Richmond mine to determine the behavior of the community members and correlate these with community structure and environmental conditions. Recent work has highlighted the accuracy and sensitivity of current proteomics approaches, enabling strain-level resolution of protein identification, which led us to uncover evidence of recombination between the two Leptospirillum group II sequence types in a genomically uncharacterized biofilm.20 In this report, we modeled and quantified the effects of genome sequence divergence on the levels of peptide and protein identification. We developed a random substitution model that considers the AAI, peptide length distribution, the fraction of unique and detectable peptides (by our 2D-LCMS/MS method), and protein length. To evaluate the effects of experimental constraints, we simulated random amino acid divergence in the database and tested the effects on the identification of the subset of peptides, identified in an actual proteomics experiment. Because amino acid substitutions do not occur randomly, we analyzed theoretical peptide recovery levels using genomic data from closely related strains and species. Finally, we assessed the modeled predictions by quantifying decline in protein identification as a function of amino acid identity using an experimental proteomic data set. Our results can be used as guidelines for determining the feasibility of the shotgun proteomics approach for analysis of unsequenced isolates and communities based on their similarity to organisms that have been sequenced.

Materials and Methods Sampling and Community Sequence Data. This study made use of proteomic and genomic data collected from acid mine drainage biofilms from the Richmond Mine (Iron Mountain, Redding, CA). Genomic data sets were generated from the 5-way site in March 2002 and UBA site in June 2005, as previously described.6,20 Both biofilms were dominated by Leptospirillum group II (∼75%) and contained lower amounts of Leptospirillum group III, different archaeal species, and other Bacteria and Eukarya. Community genomic library construction and sequencing methods for both community genomic data sets (5-way CG and UBA) have been previously described.6,20 Although the UBA genomic data set was used in the analysis of a biofilm from another site,20 this is the first analysis of the

research articles proteome of the UBA biofilm (see below) and the first case where proteomic and genomic data sets derive from the same environmental sample. Isolate Sequence Data. Genome sequence data were derived from the integrated microbial genome resource (IMG, http:// img.jgi.doe.gov) for Burkholderia cencocepacia HI2424 (reference genome 1), Burkholderia xenovorans LB400 (tester genome 1A28), Burkholderia vietnamiensis G4 (tester genome 1B), Burkholderia pseudomallei (tester genome 1C29), Burkholderia cepacia complex (Bcc) strain 383 (tester genome 1D), Burkholderia ambifaria (tester genome 1E). as well as for Bacillus anthracis strain Ames (reference genome 230), Bacillus cereus E33L (tester genome 2A), and Bacillus cereus ATCC10987 (tester genome 2B31). All non-published sequence data were produced by the U.S. Department of Energy Joint Genome Institute (http://www.jgi.doe.gov/). We determined the orthologs between each tester genome and their respective reference genome based on a reciprocal best blastp search32 and elimination of those protein pairs for which the ratio between the blastp score and the blastp score of the reference genome (B. anthracis or B. cenocepacia) self-blast was smaller than 0.4.30,31 Similarly, orthologs between the Leptospirillum group II proteins of the 5-way CG (reference genome 3) and UBA (tester genome 3) variant were determined.20 Query and search protein databases are summarized in Table 1. Proteomics Data Set. The tandem mass spectra (MS/MS) analyzed in this manuscript were collected using the methods described earlier.19,20 Briefly, a complex proteome was extracted from a natural biofilm collected at the UBA location of the Richmond Mine, Iron Mountain, CA. One fraction of the biofilm was set aside for genomic analysis, and the remainder was frozen on dry ice on site for proteomics. An extracellular fraction was collected by washing 5 mL of compacted biofilm with 15 mL of H2SO4, pH 1.1, at 4 °C. Cells were lysed using sonication (on ice) in 12 mL of 20 mM Tris-SO4, pH 8.0, buffer and subsequent addition of 40 mL of 0.1 M Na2CO3, pH 11. After lysis, the remaining proteins were separated by (ultra)centrifugation in three fractions (membrane, soluble, and whole cell). Each individual fraction was denatured, reduced, and digested into peptides using sequencing grade trypsin (Promega, Madison WI). Peptides were de-salted, concentrated, and analyzed via nano2D-LC (SCX-RP)-MS/MS on a linear ion trap mass spectrometer (Thermo Finnigan, San Jose CA), using three independent runs per fraction. As described previously in more detail,20 generated mass spectra were matched to peptides using SEQUEST33 and filtered and sorted with the DTASelect algorithm34 considering fully tryptic peptides only, with ∆CN of at least 0.08 and cross-correlation scores (Xcorr) of at least 1.8 (+1), 2.5 (+2), and 3.5 (+3) at the peptide level. The DTASelect results from all proteome fractions were compared with the Contrast program34 to obtain a final list of identified proteins. At least two peptides had to be identified within the same run in order for a protein to be deemed identified, which results in a false-positive rate of less than 1%.20 In this process, we also identified peptides shared by both Leptospirillum group II types as well as those unique to only one of the orthologs. The 5-wayCG + UBA database used for Sequest searches contained all proteins derived from the 5-way CG genomic data and all Leptospirillum group II and group III proteins from the UBA genomic data20 (Table 1). A thorough analysis of falsepositive rates associated with our search algorithm were described earlier.20 Data used in this study were restricted to the identified peptides belonging to Leptospirillum group II and Journal of Proteome Research • Vol. 6, No. 8, 2007 3153

research articles

Denef et al.

Table 1. Overview of Databases and Data Sets database/data set

entries

description

1. Sequest Searches To Generate UBA Proteomics Data Set 16,170 proteins Protein database derived from 5-way CG6 and UBA community genomic data20 UBA proteomics data set 4066 proteins Proteins (and peptides) identified from the UBA sample by community proteomics 2. Simulation Model 3. Comparison of Database and Data Set Length, MW, pI, Hydrophobicity Distributions UBA_LeptoII + LeptoIII database 5785 proteins 5-wayCG + UBA database proteins encoded by Leptospirillum group II and group III populations present in the UBA sample UBA_LeptoII + LeptoIII proteomics data set 2378 proteins/ Subset of proteins/peptides from the UBA proteomics data set 25083 peptides belonging to Leptospirillum group II (UBA type) and group III 4. Model Validation by in Silico Digestion of Tester Proteomes Burkholderia cenocepacia HI2424 7046 proteins Reference genome 1 Burkholderia xenovorans LB400 3507 proteins Tester genome 1A proteins with orthologs in reference genome 1 Burkholderia vietnamiensis G4 4401 proteins Tester genome 1B proteins with orthologs in reference genome 1 Burkholderia pseudomallei K96243 3627 proteins Tester genome 1C proteins with orthologs in reference genome 1 Burkholderia cepacia complex (Bcc) strain 383 5184 proteins Tester genome 1D proteins with orthologs in reference genome 1 Burkholderia ambifaria MC40-6 4987 proteins Tester genome 1E proteins with orthologs in reference genome 1 Bacillus anthracis Ames 5311 proteins Reference genome 2 Bacillus cereus E33L 4451 proteins Tester genome 2A proteins with orthologs in reference genome 2 Bacillus cereus ATCC10987 4275 proteins Tester genome 2B proteins with orthologs in reference genome 2 Leptospirillum group II (5-way CG type) 2846 proteins Reference genome 3 Leptospirillum group II (UBA type) 2157 proteins Tester genome 3 proteins with orthologs in reference genome 3 5. Experimental Validation UBA_LeptoII proteomics data set 1264 proteins/ Subset of proteins/peptides from the UBA proteomics data set 15747 peptides belonging to Leptospirillum group II (UBA type) 5wayCG + UBA database

group III and were strictly used as a test set (Table 1), while the full proteome and biological significance will be published at a later date. Probability Model. The probability that the amino sequence of a peptide is altered (thus no longer identified) by one or more mutations was calculated using the following formula: Ppep )

∑((1 - (AAI/100) )n /N ) j

j

T

(for j ) 6-56)

AAI is the average level of amino acid identity (%) between the query proteins and their orthologs in the database; j is the peptide length, constrained from 6 to 56 as these are approximately corresponding to the MW cutoffs (700-6000 Da) for detection by tandem mass spectrometry; nj is the number of peptides in the UBA_LeptoII + LeptoIII database (data set) with length j, and NT is the total number of peptides in the UBA_LeptoII + LeptoIII database (data set) between 6 and 56 amino acids in length. This way Ppep was calculated as a function of its length and the length distribution in either the database or data set. Using the Ppep calculated using either the UBA_LeptoII + LeptoIII database or the data set peptide length distribution, we calculated the probability for a protein to be identified (Pprot) as a function of amino acid divergence using the formula Pprot ) 1 - (Ppep)Np - (Ppep)Np-1‚(1 - Ppep)‚Np!/(1!‚(Np - 1)!) Np is the average number of peptides in a protein, calculated by rounding Lprot/Lap to a number without decimals, where Lprot is the protein length and Lap is the average peptide length in the UBA_LeptoII + LeptoIII database or data set. The first term calculates the probability that all of a protein’s peptides have undergone amino acid substitutions, while the second term calculates the probability that all but one of a protein’s peptides were modified in their sequence. This could be simplified to Pprot ) 1 - (Ppep)Np - (Ppep)Np-1‚(1 - Ppep)‚Np 3154

Journal of Proteome Research • Vol. 6, No. 8, 2007

So as not to consider nondetectable and non-unique peptides, a correction factor was subtracted from Pprot, calculated using the formula

∑{(P

Np-k ‚(1 pep)

- Ppep)k‚Np!/(k!‚(Np - k)!)}‚{(1 - fdu)k

+ (1 - fdu)k-1‚fdu‚k!/(1!‚(k - 1)!)}

(for k ) 2 to Np and fdu is the fraction of detectable, unique peptides) The term between the first set of curly brackets calculates the probability for the protein to be identifiable using its ortholog by 2, 3, ..., Np peptides. The second term calculates the probability that all or all but one of the identifiable peptides are undetectable or non-unique. The fraction fdu was set at 0.40 based on the average sequence coverage of the identified Leptospirillum group II and group III proteins (∼40%). The correction factor formula could be simplified to

∑{(P

Np-k ‚(1 pep)

- Ppep)k‚Np!/(k!‚(Np - k)!)}‚{(1 - fdu)k + (1 - fdu)k-1‚fdu‚k}

Model Validation by in Silico Digestion of Tester Proteomes Predicted from Isolate Genome Sequence. Proteins of each of the tester genomes with an ortholog in its corresponding reference genome were digested in silico, as if perfectly cleaved by trypsin (cleavage after every lysine (K) or arginine (R), unless followed by a proline (P)). The resulting peptides (between the detectable mass range of 700-6000 Da) were searched in the reference genome’s protein database. If two or more peptides matched the ortholog, the protein was deemed identified. Simulation by in Silico Evolution of Leptospirillum Protein Sequences. The peptides identified using the complete 5-wayCG + UBA database matched to Leptospirillum group II and Leptospirillum group III (Table 1, UBA_LeptoII + LeptoIII proteomics data set peptides) were used to query the UBA_LeptoII + LeptoIII database (Table 1) after it had been

research articles

Strain- and Species-Level Sequence Divergence

Table 2. The UBA_LeptoII Proteomics Data Set Was Divided into Subclasses Based on Amino Acid Divergence between UBA and 5-way CG Protein Variantsa subclass

I

II

III

IV

V

VI

VII

VIII

IX

X

δaa (%) Identified (UBA_LeptoII) Identified (5-wayCG_LeptoII)

30, calculated based on the Kyte and Doolittle scale35) and, to a lesser extent, extreme isoelectric point values (12) had a decreased identification probability. We evaluated the effects of length, uniqueness, and hydrophobicity cutoffs on theoretical rates of identification of Leptospirilum group II peptides. In a database that assumed no missed cleavages, 52% of all peptides would not be identified based on MW cutoffs. Missed cleavages will, to a large extent, moderate this loss in peptide identification levels (Figure S1 in Supporting Information). Of the remaining identifiable peptides, ∼1% were non-unique and ∼2% did not pass the hydrophobicity cutoffs.

Table 4. Summary of Modeling Predictions and the Evaluation Using the UBA_LeptoII Proteomics Data Seta

Because amino acid substitutions are not randomly distributed throughout a protein’s sequence (as assumed in the probabilistic and simulation models), we evaluated the effect of substitutions on peptide identification by counting the peptides identifiable using both Leptospirillum group II genomes. Out of 37 571 predicted peptides for the UBA-type orthologs that fall within the detectable MW range, 24 262 could be matched to the 5-way CG type orthologs, corresponding to a Ppep ) 0.35. This is considerably better than any of the random substitution models at the same level of sequence divergence (5%, Figure 1). The reason for this observation is that any process that clusters substitutions (e.g., their partitioning away from functional sites) will decrease the number of peptides with modified masses as compared to a random substitution model. Protein Identification Levels. Protein identification levels were calculated using the calculated Ppep while requiring at least two identified peptides per protein. With the use of the Ppep based on the peptide length distribution of the UBA_LeptoII + LeptoIII database (Figure 2A) or the UBA_LeptoII + LeptoIII proteomics data set (Figure 2B), the probabilistic model determined the protein identification levels as a function of amino acid divergence as well as protein length. The shorter the protein, the more sensitive the protein was to identification loss (because there were fewer peptides available to be identified). In line with the Ppep, protein identification levels declined faster with sequence divergence when using the peptide length distribution for the data set (Figure 2B) instead of the database (Figure 2A). Virtually all average-sized proteins (300 aa) could still be identified when using a database of proteins sharing 95% AAI to those in the actual sample. The probabilistic model calculated the probabilities for a set of discrete protein lengths, while the models with experimental or biological constraints evaluated the overall protein identification loss without protein size discrimination. To allow direct comparison between all models, we used the length distribu-

size class (no. aa)

800

3.1 1000 aa

Pprot at AAI ) 95%

AAI at which Pprot ) 50%

98% 83% 92% 65% 71% 90% 82%

78% 87% 87% 93% 92% 77% 89%

a Db ) UBA_LeptoII + LeptoIII database peptide length distribution; Data ) UBA_LeptoII + LeptoIII data set peptide length distribution; corrected ) after introduction of a correction factor for undetectable and non-unique peptides.

tion of the Leptospirillum group II (UBA type) protein complement and the Pprot as calculated using the probability model (Table 3) to calculate the average protein identification levels as a function of amino acid divergence (Table 4). At 95% AAI, when using the UBA_LeptoII + LeptoIII database peptide length distribution and the probabilistic model, we predicted a protein identification rate of 98%. Half of the proteins were no longer identified at 78% AAI (Figure 3). When using the UBA_LeptoII + LeptoIII data set peptide length distribution, these numbers changed to 92% identification and an AAI of 87%. In the in silico analysis of variation between related genomes when only considering peptides within the mass range detectable by our methods (700-6000 Da), 90% of proteins could still be identified at 95% AAI. However, based on the nonrandomness in the locations of amino acid substitutions, identification remained higher than was predicted by the probability approach at higher sequence divergence levels, as illustrated by the fact that half of the proteins were no longer identified at 77% AAI (Figure 3). Performing the analysis of protein identification levels using our simulation approach, thus constrained by experimental limitations to peptide identification but not by the biological nature of mutation distribution, protein identification was dramatically reduced as compared to the non-experimentally constrained models. At 95% AAI, we only retained 71% of the originally identified proteins, while half of the proteins were lost at an AAI of 92% (Figure 3). This reduction as compared to the other models was due to the same reason that shorter proteins were lost at a significantly higher rate than larger ones in the probability model (Figure 2), namely, a lower number of available peptides. Though most proteins have many peptides available, only a limited number of these are of use for protein identification. Some peptides were nondetectable by our 2D-LC-MS/MS method (due to length, signal peptide cleavage, or hydrophobicity cutoffs (Figure S1A-D in Supporting Information)). Others were not identified due to low abundance of the protein, which lowers the probability of its peptides to be picked up by the MS’s shotgun sampling. Journal of Proteome Research • Vol. 6, No. 8, 2007 3157

research articles

Denef et al.

Figure 3. Protein identification levels in function of amino acid divergence, calculated based on the probability model using Ppep calculated based on the UBA_LeptoII + LeptoIII database (light blue, dotted line) or UBA_LeptoII + LeptoIII data set (light blue, solid line) peptide length distribution (Figure 1), and averaged using the Pprot per protein length size class (Figure 2) from modeling in combination with the Leptospirillum group II (UBA-type) protein length distribution; the same probability model, though Pprot is corrected for the fraction of detectable, unique peptides (40%) (dark blue); the random substitution simulation (orange), incorporating effects of experimental constraints; in silico analyses of related (76-98% AAI) sequenced genomes (green), thus, displaying effects of biological constraints; the actual UBA_LeptoII proteomics data (black), showing the levels at which identified Leptospirillum group II UBA-type proteins were also identified as their Leptospirillum group II 5-way CG type ortholog. In addition, the overall identification level is also displayed (orange dot), as well as the theoretical identification levels, in case no experimental constraints on peptide identification were present (green dot).

In the UBA_LeptoII + LeptoIII proteomics data set, the average protein sequence coverage was ∼40%. When adjusting the probabilistic approach by introduction of a correction factor reflecting the inability to identify 60% of peptides, protein identification decreased to levels similar to those in the simulation (Figures 2 and 3). At 95% AAI, when using the database peptide length distribution, 83% of proteins were still identified, while half of the proteins were no longer identified at an AAI of 87%. When using the data set peptide length distribution, these numbers became 65% identification at 95% AAI and a loss of half of the identified proteins at an AAI of 93%. The fact that sensitivity to protein loss is strongly dependent on coverage was also clearly demonstrated by plotting the number of peptides present per protein and the actual identified number of peptides per protein. When these distributions were plotted using the Leptospirillum group II UBA type and 5-way CG type orthologs (AAI ) 95%), a shift occurred and the area under the curves below 2 peptides/ protein represented the proteins no longer identified (Figure S2 in Supporting Information). This area was significantly larger when the identified peptides/protein curve were shifted than when the peptides present/protein curve were shifted. This explains the significant decrease in protein identification when constrained by experimental limitations to peptide identification. Experimental Assessment. To assess the model predictions, the UBA_LeptoII proteomics data set was used. These data 3158

Journal of Proteome Research • Vol. 6, No. 8, 2007

originate from the UBA biofilm sample for which matching genomic data were available. First, the UBA-type Leptospirillum group II protein identification levels were determined using the 5-way CG Leptospirillum group II genomic data set (AAI of 95%) assuming that all peptides within the 600-7000 Da range were detectable. The result, a protein identification level of 92%, was in line with the in silico analyses of Bacillus and Burkholderia genomes. We then used only the Leptospirillum group II peptides identified in the proteomics experiment on the UBA biofilm sample (UBA_LeptoII proteomics data set). As compared to querying the UBA Leptospirillum group II proteins, peptide identification was reduced to 69% (10 908 out of 15 747 peptides) and protein identification to 82%, when querying the 5-way CG orthologs. This was significantly better than the result of the random substitution simulation. Some expressed proteins were encoded by genes that are uniquely present in one strain and cannot be identified when using genomic information from another strain or species. If incorporated, protein identification dropped to 79%. Division of the data into subclasses (Table 2) and averaging of the data in each subclass to plot the data as a function of amino acid divergence (median of subclass) allowed a comparison with our modeling data (Figure 3). The actual protein identification loss trend was located within the solution space boundaries set by the bestcase (theoretical probability and biological constraint models) and worst-case (simulation and corrected probability model) scenarios. At an AAI of 95%, we identified 82% of the originally

research articles

Strain- and Species-Level Sequence Divergence

Figure 4. Normalized loss of protein coverage (by identified peptides) as a function of amino acid divergence (δaa) and linear trend lines showing the correlation between δaa and normalized coverage loss. Data was grouped by rounding δaa on the first decimal. Average and standard error ()stdev/xn) per δaa group are given for all data (blue) and after exclusion of those proteins that were only identified using the protein variants 100% identical to the sample’s protein complement (red).

identified proteins, while half of the proteins were no longer identified at an AAI of 89%. For Leptospirillum group II proteins present in the UBA sample and identified using the 5-way CG protein variants, an average protein coverage of 28% was obtained, as compared to 44% when using the UBA variants. As protein identification is dependent on protein coverage and in particular number of peptides (see above), we plotted the loss in coverage for proteins identified using both orthologs as a function of the amino acid dissimilarity (Figure 4). When considering all data, the fitted correlation indicated a 7.4% coverage loss for every 1% amino acid divergence between the proteins sample and the database. This extrapolated to total identification loss at 11% sequence divergence. Because proteins with lower coverage are more susceptible to identification loss as the result of sequence divergence, we excluded the orthologs only identified using the UBA-type Leptospirillum group II genomic data from the analysis. This resulted in a loss of 5.4% of the coverage per 1% amino acid divergence, predicting total loss of coverage or identification at 16% sequence divergence. In both cases, the coverage loss was initially higher than the values based on the fitted correlation, which could again be caused by the loss of multiple peptides by one amino acid substitution due to missed cleavages. These two approaches set the worst-case and bestcase limits, from which we could deduce a 6.4% coverage loss per 1% amino acid divergence and total identification loss at 14% amino acid divergence, which is in line with the data presented in Figure 3. For a community sample, sequence coverage is in part dependent on the abundance of the organism. It is important to note however that only slightly lower average sequence coverage was achieved for the Leptospirillum group III proteins (∼35%), which was about 2 times less abundant in the biofilm than Leptospirillum group II. Moreover, since we normalized the coverage loss to the initial

coverage when using the 100% identical proteins, the deduced coverage loss functions should be broadly applicable.

Conclusion By using probability-based modeling, simulations, and the evaluation of experimental data, we quantified the influence of amino acid dissimilarity on protein identification levels when using the shotgun proteomics approach and determined underlying factors affecting these levels. The main conclusion from our analyses is that shotgun proteomics is capable of cross-strain identifications but avoids cross-matches with more distantly related organisms. While the microbial species definition is contentious, a separation based on