Revealing Unexplored Sequence-Function Space Using Sequence

Jul 27, 2018 - Here, we highlight the use of sequence similarity networks (SSNs) to identify previously unexplored sequence and function space...
3 downloads 0 Views 4MB Size
From the Bench Cite This: Biochemistry XXXX, XXX, XXX−XXX

pubs.acs.org/biochemistry

Revealing Unexplored Sequence-Function Space Using Sequence Similarity Networks Janine N. Copp,‡ Eyal Akiva,†,¶ Patricia C. Babbitt,†,¶ and Nobuhiko Tokuriki*,‡ ‡

Michael Smith Laboratories, University of British Columbia, 2185 East Mall, Vancouver, British Columbia V6T 1Z4, Canada Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, California 94158, United States ¶ Quantitative Biosciences Institute, University of California, San Francisco, California 94143, United States †

Downloaded via DURHAM UNIV on July 30, 2018 at 00:09:50 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.

S Supporting Information *

ABSTRACT: The rapidly expanding number of protein sequences found in public databases can improve our understanding of how protein functions evolve. However, our current knowledge of protein function likely represents a small fraction of the diverse repertoire that exists in nature. Integrative computational methods can facilitate the discovery of new protein functions and enzymatic reactions through the observation and investigation of the complex sequencestructure−function relationships within protein superfamilies. Here, we highlight the use of sequence similarity networks (SSNs) to identify previously unexplored sequence and function space. We exemplify this approach using the nitroreductase (NTR) superfamily. We demonstrate that SSN investigations can provide a rapid and effective means to classify groups of proteins, therefore exposing experimentally unexplored sequences that may exhibit novel functionality. Integration of such approaches with systematic experimental characterization will expand our understanding of the functional diversity of enzymes and their associated physiological roles.

T

experimental scientists; therefore, we have a more comprehensive understanding of their functional repertoire.6,7,11−23 However, for most superfamilies, experimental characterization is sparse and it is extremely challenging to assign, annotate, and predict function. Conventional computational methods annotate new sequences based on the most homologous sequence that has been experimentally studied. However, such approaches are problematic when only a handful of functions have been characterized within the superfamily, or sequence set, of interest. Available annotations are inherently error prone,24,25 and misannotation is a critical issue that has important ramifications on interpretation; ultimately, these challenges limit our comprehension of functional diversity and evolution. Enhanced approaches to comprehensively characterize protein superfamilies and guide effective experimental exploration of their sequence space are an emerging need across diverse research fields, including biochemistry, enzymology, and evolutionary biology. Systematic investigation of functional diversity, in molecular and biological terms, will vastly improve our ability to harness and exploit this knowledge for both fundamental and applied sciences.26−28

he expansive range of protein functions found in nature is a foundational platform for the evolution, adaption, and survival of organisms.1−5 The breadth of functional diversity and its associated consequences and implications are therefore critical for our understanding of modern biology. Researchers have discovered a wide range of protein and enzyme functions during the last century that have been characterized in biochemical and biological terms. However, the full repertoire of functional diversity that exists in nature likely remains concealed. The revolution in sequencing technology over the past decade has resulted in increasing amounts of sequence information. More than 100 M protein sequences have been deposited in public databases, and this number easily exceeds several billion if metagenomic sequence information is also considered; experimental characterization of all identified sequences is unfeasible. Therefore, the challenge now lies in how we incorporate this new sequence information to deduce and annotate function and, concurrently, how we can design experimental efforts to more effectively explore unknown sequence space.6,7 Current classification systems assign new sequences to large groups, i.e., a protein fold or superfamily (a cluster of protein sequences that share a common structural fold, active site, and/ or mechanistic features8,9). However, at least one-third of protein superfamilies comprise a wide variety of functions;10 such superfamilies can contain >100 000 sequences and harbor >100 distinct functions. Several functionally diverse superfamilies (and smaller families) have been extensively explored by © XXXX American Chemical Society

Special Issue: Discovering New Tools Received: April 25, 2018 Revised: July 15, 2018

A

DOI: 10.1021/acs.biochem.8b00473 Biochemistry XXXX, XXX, XXX−XXX

From the Bench

Biochemistry

enable hypotheses about how subgroupings of members within a superfamily, which typically share very low sequence identity,53 are related. Guided by these clues, structure-based comparisons and other approaches can then reveal remote homologies between distant sequence sets within the superfamily that have previously remained cryptic.56 A workflow for “in-house” generation and analysis of SSNs is outlined in the sections below and is additionally provided in step-by-step detail in the Supporting Information. Associated FASTA files and SSNs of the nitroreductase (PF00881) superfamily can be found on the Structure Function Linkage Database (SFLD): sfld.rbvi.ucsf.edu/django/superfamily/122/. For users with access to advanced computational expertise, a more comprehensive approach is outlined in Workflow for InHouse SSN Generation. As an alternative, Web servers such as the Enzyme Function Initiative (EFI-EST) provide a streamlined framework for the generation of SSNs.57 An in-depth manually curated set of precomputed networks for several large and functionally diverse enzyme superfamilies is also available for download from the SFLD.58−60 Many of the latter include networks for specific subgroups and individual reaction families within these superfamilies as well. Comprehensive tutorials for the EFI-EST and the SFLD can be found online: efi.igb.illinois. edu/efi-est/tutorial.php and sfld.rbvi.ucsf.edu/django/web/ tutorial_links/.

Here, we describe a computational approach to explore large protein superfamilies to capture a global picture of sequence diversity, concurrently revealing the proportion of sequence space that has been biochemically characterized. This approach provides a framework to assist functional assignment and, in particular, illuminates unexplored sequence space in order to expedite the discovery of new protein functions.29−35 We showcase this approach by exposing the unexplored sequence space in the nitroreductase (NTR) superfamily. The NTR superfamily comprises >24 000 sequences29 that are generally annotated as oxidoreductases. Experimentally characterized NTR enzymes typically use a ping-pong bi-bi redox reaction mechanism.36 To date, more than 10 distinct catalytic activities are observed within the superfamily including nitroaromatic, enone, quinone, and fatty acid reduction as well as dehalogenation and flavin fragmentation (Table 1).29 ExperTable 1. Examples of Enzymes with Experimentally Verified Functions within the NTR Superfamily activity/function nitroreduction nitroreduction quinone reduction fatty acid oxidation thiazole biosynthesis catalase quinolone reduction iodotyrosine dehalogenation flavin fragmentation diketopiperazine dehydrogenation F420 biosynthesis malonic semialdehyde reduction FMN reduction

EC classification

protein name/ UniProt KB ID

reference

1.5.1.34, 1.6.99.x 1.5.1.34, 1.6.99.x 1.6.5.x, 1.6.99.x 1.3.1.x, 1.6.99.x 3.4.21.x 1.11.1.6 1.6.5.x, 1.6.99.x 1.21.1.x

NfsB/P38489

41

NfsA/P17117

38

DrgA/Q55233

42

ClaER/U6C5W9

43

SagB/Q1J7H9 CinD/Q9CED0 Frm2/P37261

44 45 46

Iyd/Q9VTE7

47

1.13.11.x 1.3.3.x

BluB/Q92PC8 AlbA/Q8GED9

48 49

6.3.2.x 1.1.1.x, 1.6.99.x 1.5.1.x, 1.6.99.x

FbiB/P9WP79 RutE/P75894

50 51

TdsD/Q9FAE6

52



WORKFLOW FOR IN-HOUSE SSN GENERATION (see also SUPPORTING INFORMATION) Creating a Protein Database: Downloading a Sequence Data Set. The Pfam database (http://pfam.xfam.org/ ) is a large collection of protein families defined by structural regions or “domains”.61 Pfam 31.0 (released March 2017) contains 16 712 families and 604 clans (higher level groupings, related by similarity of sequence, structure, or Hidden Markov Model (HMM)). Thus, select the Pfam family (or clan) that you wish to investigate and download the sequence data set from UniProtKB (www.uniprot.org). For example, in UniProtKB, the Pfam identifier PF00881 (relating to the nitroreductase superfamily) is associated with 84 584 sequences (UniProt release 2018_05). Creating a Protein Database: Reducing the Sequence Data Set. Due to computational limitations, data sets comprising a large number of sequences (e.g., >10 000) must be condensed in order to enable the visualization and manipulation of SSNs on standard computers typically used by individual investigators. The Cluster Database at High Identity with Tolerance (CD-HIT) Web server,62 http:// weizhongli-lab.org/cdhit_suite/cgi-bin/index.cgi, is an effective means to select representative sequences from a group of sequences that share a certain percent identity cutoff; e.g., a single sequence can be selected to represent a set of sequences sharing a particular sequence identity. Performing this process with the PF00881 data set using a 50% sequence identity cutoff reduces the data set from 84 584 sequences to 6211 representative sequences. A lower percent identity cutoff would result in fewer representatives, and a higher identity cutoff will result in more representatives. Finalizing the Database. The Galaxy web-based tool (www.usegalaxy.org) can be used to eliminate small fragments (partial gene sequences) and very large multidomain proteins from a data set. For example, for the reduced data set of 6211 PF00881 sequences, selecting the minimal length to be 100aa and the maximum length to be 1000aa removes 196 sequences,

imental investigations of the NTR superfamily, however, have largely focused on a limited set of enzymes, in particular, “nitroreductase” enzymes that reduce the nitro moiety of various nitroaromatic compounds relevant for biotechnological applications (e.g., cancer gene therapy,37,38 cell development,39 and bioremediation40). Consequently, much of the sequence and functional space of the superfamily has not been experimentally investigated and it is likely that numerous functions are yet to be discovered.



METHODS AND RESULTS Sequence similarity networks (SSNs) are an effective way to visualize and analyze the similarity relationships between members of a superfamily.20,53−55 SSNs facilitate the inspection of very large sets of protein sequences and enable the simultaneous assessment of orthogonal information, e.g., functional diversity, when mapped onto the context of sequence similarity.53 This approach facilitates the characterization of much larger data sets compared to the conventional multiple sequence alignment (MSA) and phylogenetic tree analyses. The multiplicity of similarity connections that SSNs provide can B

DOI: 10.1021/acs.biochem.8b00473 Biochemistry XXXX, XXX, XXX−XXX

From the Bench

Biochemistry

the network into constituent clusters that are disconnected from each other. Visualizing Networks Using Cytoscape. The Cytoscape platform (http://www.cytoscape.org/)66 enables the visualization and manipulation of SSNs. Detailed tutorials are available at github.com/cytoscape/cytoscape-tutorials/wiki. In Cytoscape, a “node” (circle) represents a protein sequence and an “edge” (the line connecting nodes) is shown if the BLAST Evalue meets the set threshold; e.g., in Figure 1a, two nodes will be connected by a line if the BLAST E-value (of the respective sequences that they represent) meets the E-value threshold of 1 × 10−22. In Cytoscape v3.6.1, the “prefuse force-directed (none)” layout algorithm is an effective way to graphically cluster SSNs. Computational limitations constrain the number of edges that can be displayed in an SSN; e.g., networks with >1 M edges cannot typically be manipulated on a computer with 8 GB RAM. Representative sequence sets facilitate the construction of networks that are appropriate for the computational resources that are typically available within a laboratory. At more stringent edge thresholds (e.g., the PF00881 data set with a 1 × 10−40 threshold), the smaller number of edges enables the visualization of a larger set of sequences; therefore, a higher CDHIT cutoff such as 70% or 90% may be used. In our experience, generating an SSN that displays 50% ID. Edges represent an E-value threshold of 1 × 10−22. Sequences that are associated with a PDB structure are depicted as enlarged triangles, and nodes with characterized biochemical or physiological functions are colored and labeled. Examples of unexplored sequence space (i.e., clusters lacking functionally characterized members) are indicated with asterisks. Clusters containing fewer than 5 sequences have been removed for clarity. (B) A network is shown as per part A, but colored by sequence length. (C) An SSN of the NTR superfamily is shown, displaying 5207 nodes colored and annotated as per part A. The edge inclusion threshold is 1 × 10−40.

thresholds, e.g., 1 × 1030, 1 × 10−40, and 1 × 10−50, allows the visualization of clustering robustness, as small isolated clusters will emerge; this is observed in Figure 1c (viewed at an E-value threshold of 1 × 10−40; colored as per Figure 1a). Defining Appropriate SSN Thresholds. Defining the most appropriate threshold to visualize and analyze a protein

sparse. Using approaches similar to those described for generating SSNs, structure similarity networks can also be computed to aid in developing hypotheses regarding especially divergent relationships.6,29 Repeating the BLAST calculation (see Performing In-House “All vs All” BLAST) with the same data set at iteratively higher D

DOI: 10.1021/acs.biochem.8b00473 Biochemistry XXXX, XXX, XXX−XXX

From the Bench

Biochemistry Table 3. Summary of Unknown Subgroups from the NTR Superfamily

taxonomic profiling (% representation)b,c subgroup

no. of seq

avg length

avg % ID

unk1 unk2 unk3 unk4 unk5 unk6 unk7 unk8 unk9 unk10 unk11

1769 827 789 623 533 287 135 129 71 59 14

191 187 192 221 256 196 342 203 167 204 181

38 37 35 38 35 40 30 50 44 43 37

a

PDB IDs

bacteria Bdt

3BM1, 3K6H 3J62, 3PXV, 3E3K 2I7H

Str

Pro

1

90 6 2 1 1 20 35

18 17 15 4 14 21

2RO1

Frm 57 67 70 36 9

5 48

1 10

88 28

Act

Oth

3 2 1 73 14 5 24 93 3 2

8 4 1 5 21 7 1 1 7

Ar

Eu

5

1 3

2

1

1

2 100

ND 6 3 6 10 3 4 3 1 7 3

a

Average percent ID of subgroup members. bTaxonomical frequencies are based on UniProtKB/NCBI data retrieved for each subgroup member. Abbreviations: Ar (Archaea); Eu (Eukaryota); Bdt (Bacteroidetes); Str (Streptomycetales); Pro (Proteobacteria); Frm (Firmicutes); Act (Actinobacteria); Oth (other). “ND” refers to sequences typically originating from metagenomic surveys. c

subgroup at thresholds more stringent than 1 × 10−22 (dark blue nodes in Figure 1a,c). Although sequence similarity does not necessarily mean conserved function (see The Challenge of Analyzing Unknown Sequence Clusters), sequence similarity relationships are a powerful means to guide hypotheses of functional diversity and divergence when combined with other characteristics, such as the conservation of key catalytic residues. Our analysis of the NTR superfamily resulted in 22 sequence clusters.29 After careful curation of the literature, we established that 14 clusters included at least one biochemically characterized sequence. Characterized sequences were used to name the clusters (or “subgroups”); i.e., the NfsA subgroup includes E. coli NfsA67 and close homologues (cyan nodes in Figure 1a), and the Iyd subgroup is exemplified by the Iyd enzyme that catalyzes the dehalogenation of iodinated tyrosine68 (magenta nodes in Figure 1a). Although SSN-based clustering and annotation are effective ways to identify explored (and thus unexplored) sequence space within a protein superfamily, the SSN clusters identified through this analysis are not necessarily monofunctional. It is likely that there are still diverse functions within “known” clusters (see the Deducing Functional Diversity within Subgroups of the NTR Superfamily and Complexities of Inferring Function from Sequence Similarity Relationships). Delineation of monofunctional clusters requires very careful, case-by-case examination with supporting experimental evidence and consequently cannot be easily streamlined.

superfamily is a critically important step in SSN analyses. Thus, for initial analysis of a superfamily, it is recommended that users generate networks using several arbitrary thresholds across a wide range to identify the most suitable threshold for their research aims. The optimal E-value threshold can substantially differ depending on the data set and the purpose of the analysis. For instance, a stringent threshold less than 1 × 10−100 may be necessary to delineate highly similar enzyme sequences that share the same chemical reaction but differ in substrate specificity. An initial threshold “scan” can be performed, generating multiple networks across a broad range of highly permissive (e.g., 1 × 10−10) to highly stringent (e.g., 1 × 10−120) E-values. Subsequently, on the basis of initial analyses of these SSNs, the user can then more systematically sample different thresholds on a finer scale. For example, the threshold chosen for Figure 1a (E-value 1 × 10−22) was the result of sampling a wide range of E-value scores, from 1 × 10−10 to 1 × 10−100, while looking for consistency between the sequence-similarity-based clustering and functional information. As shown in Figure 1c, at a threshold of 1 × 10−40, sequence clusters are more discernible; however, intercluster relationships are lost. Conversely, the network is highly connected and uninformative at thresholds lower than 1 × 10−15. “Thresholding”, i.e., the process of iteratively analyzing thresholds of increasing stringency, in combination with additional information about the associated proteins (functional assignment, phylogeny, domain architecture, etc.; see The Challenge of Analyzing Unknown Sequence Clusters and Supporting Information) facilitates the systematic analysis of sequence and functional relationships. Identification of Known and Unknown Sequence Clusters within the Superfamily. One of the advantages of SSN analysis lies with the ability to overlay and visualize multiple types of information. In Figure 1a, all biochemically characterized nodes are enlarged and colored, enabling clusters that contain characterized enzymes to be visualized. Concurrently, clusters with no characterized sequences, i.e., representing unexplored sequence space, are simultaneously identified. By integrating all available functional knowledge, even though such information is extremely sparse in the NTR superfamily, we found that it tracks broadly with the sequence clustering visualized at 1 × 10 −22 . For example, experimentally characterized BluB enzymes,48 and sequences showing significant similarity to BluB sequences, cluster in a defined single



THE CHALLENGE OF ANALYZING UNKNOWN SEQUENCE CLUSTERS Previously Unexplored Sequence Space in the NTR Superfamily. Eight subgroups of the NTR superfamily have no known biological roles or documented activities. Thus, they represent unexplored sequence space. These subgroups were designated “unk” for unknown (Table 3) and are labeled by subgroup size (which can be calculated via the output files from CD-HIT, Supporting Information). A comprehensive table summarizing all (known and unknown) subgroups of the NTR superfamily was reported previously.29 In addition to their distinct sequence similarity relationships within the superfamily, i.e., distinct clustering, each unknown subgroup displays unique characteristics, e.g., sequence length, taxonomic profiles, etc., indicating that they may possess distinct E

DOI: 10.1021/acs.biochem.8b00473 Biochemistry XXXX, XXX, XXX−XXX

From the Bench

Biochemistry

Figure 2. Representative structures from the NTR superfamily. Eight PDB structures from the NTR superfamily are shown in ribbon representation. The bound FMN is displayed in the stick model with carbons colored in yellow. (Only one active site, of the two encoded, is shown for simplicity.) Structural extensions29 are colored in green (extension 1), red (extension 2), and blue (extension 3) and are labeled in 3e39.

unk2, unk3, and unk6, structural information is available (e.g., PDB IDs 3k6h, 3pxv, 2i7h, and 2r01, respectively) (Figure 1c and Figure 3). Structures from the Hub subgroup closely resemble the minimal FMN-binding scaffold that is a universal constituent of members of the NTR superfamily.29 As seen in PDB ID 3e39, Hub subgroup structures typically display short (or no) insertions at the three superfamily “hotspots” (the structural extension sites E1, E2, and E3, respectively) (Figure 2). A structural insertion is observed at the E3 hotspot in PDB ID 2r01 (unk6). Other subgroups that include members with characterized functions also display insertions at E3 (e.g., PDB ID 1f5v); however, the unk6 E3 insertion displays a novel βstrand architecture. Similarly, deviations are seen in the E1 extensions of unk1, unk2, and unk3. Structural modifications, especially when conserved throughout a sequence cluster and distinct from other superfamily members, are rich grounds for further investigations. Furthermore, PyMOL plugins such as ProMol,77 when integrated with BLAST, Pfam, and DALI,78 can aid functional predictions79 Deducing Functional Diversity within Subgroups of the NTR Superfamily. The primary level of sequence clustering (subgrouping) can emphasize large areas of unexplored sequence space within the superfamily. However, multiple functions may have diverged within a sequence cluster or subgroup (discussed in detail in Complexities of Inferring Function from Sequence Similarity Relationships). Analysis of secondary sequence clusters or subsubgroups (SSGs) can therefore be useful as another source of unexplored sequence space. For example, within “known” subgroups of the superfamily, there may also be additional functions to one primarily identified, especially for the large diverse subgroups that contain >1500 members. The generation of MSAs for individual clusters or subgroups, as discussed above, can facilitate the quantification of intrasubgroup diversity and is thus an informative and recommended extension of SSN-based analyses (Supporting Information). Subgroups such as BluB and Iyd show a high sequence conservation (42 and 45% average sequence identity, respectively), including the conservation of key experimentally verified functional residues,80,81 potentially indicating that members of these subgroups are likely to catalyze the same enzymatic reaction. By contrast, the SagB subgroup, which is large and diverse (>1900 sequences with 32% average sequence identity), has very few characterized members. Furthermore,

functions from the known repertoire in the superfamily (Table 3). For example, members of unknown subgroup 1 (“unk1”) are predominantly found in Proteobacteria and members of unk9 are predominantly found in Firmicutes. Members of unk2 and unk5, in comparison, show a much wider phylogenetic range and can be found in Archaea, Eukaryota, and Bacteria (Table 3). Insight can also be gained when observing variations in sequence length that can indicate insertion or deletion of structural elements and/or fusion of additional domains. For example, members of unk1, unk2, unk3, unk6, unk9, and unk11 display short sequences that resemble the minimal FMNbinding scaffold of the NTR superfamily (Figure 2b; Table 3).29 In contrast, unk4, unk5, unk7, unk8, and unk10 all display longer sequences, potentially indicating the insertion of structural loops or fusion of additional domains. Large but sporadic variation in the sequence length (when observed at a subgroup level) may also indicate fusion proteins that can provide inference of potential biosynthesis pathways. Support for this hypothesis is observed in sequence clusters that have experimentally verified functions and are associated with known biosynthetic pathways. For example, characterized enzymes from the BluB subgroup fragment flavin mononucleotide (FMN) for cobalamin biosynthesis.48 Profiling the sequence lengths and Pfam domain associations of BluB subgroup members revealed that 15% (125/859) show an increase in sequence length of >100 aa over the typical length of the NTR domain (PF00881). This increase in sequence length is positively associated with the fusion of other enzymes from the cobalamin biosynthetic pathway such as CbiA (PF01656), CobB (PF07685), CobT (PF02277), and CobU (PF02283).69 If available, it can be advantageous to analyze the position of sequence insertions and deletions in 3D structure, especially if they are located near active site residues. MSAs, facilitated by programs such as MUSCLE, Clustal Omega, or 3D-Coffee,70−72 can be generated for cluster representatives (or extended data sets, using the cluster information from the CD-HIT output, Supporting Information) to evaluate the conservation of active site residues. PDB files can be extracted from the RCSB PDB,73 and structural information can be analyzed in detail via programs such as PyMOL or Chimera.74,75 If crystal or NMR structure information is unavailable, model structures can be generated from primary sequences (e.g., via SWISS-MODEL76). For example, although a function is yet to be established for unk1, F

DOI: 10.1021/acs.biochem.8b00473 Biochemistry XXXX, XXX, XXX−XXX

From the Bench

Biochemistry

Figure 3. Analysis of the NfsA subgroup of the NTR superfamily. A sequence similarity network of the NfsA subgroup visualized using Cytoscape at an edge inclusion threshold of 1 × 10−56. Select subsubgroups (SSGs) are labeled, nodes including functionally verified proteins are enlarged and colored in cyan, and proteins experimentally characterized in this work are enlarged and displayed as triangles. (inset) The relative NADPH/NADH activity of representative SSG members that encode conserved and divergent residues at the extension 3 (E3) Arg203 position (PDB ID 1f5v numbering). Enzymes were analyzed via an adapted NAD(P)H depletion method83 using 4-nitrobenzamide as the terminal electron acceptor. UniProtKB identifiers for the enzymes investigated are SSG3, Q88K03; SSG6, R7D286, E6K352; SSG7, P94424; SSG10, E0FFF4; SSG12, P39605; SSG13, Q8YA64; SSG14, F2I842.

deviate from the archetype NADPH metabolism of the NfsA subgroup and instead preferentially utilize NADH.

when this subgroup is observed in higher resolution SSNs, i.e., at more stringent edge inclusion thresholds, multiple distinct sequence clusters can be observed (Figure 1a,c); these SSGs may individually possess different catalytic and substrate specificities. Similarly, less than 40% average sequence identity is observed within each of the NfsA and NfsB/MhqN subgroups.29 For example, 28 enzymes have been experimentally investigated from the NfsA subgroup. However, examination of NfsA subgroup sequences using a higher percent identity cutoff (90% ID) and a more stringent threshold (1 × 10−56) revealed that these 28 enzymes are from a narrow subset of the NfsA subgroup (Figure 3). Inspection of subgroup-level networks is thus informative to establish functional diversity of superfamily subgroups. This approach can be complemented by MSAs, which are typically more robust at a subgroup or SSG level than a superfamily level (due to higher sequence identities). MSA analysis can inform key active site architecture and highlight residues essential for catalytic activity, and an investigation of residue conservation patterns can be a useful approach to deduce divergent function. For example, arginine (Arg203, E. coli P17117 numbering) is important for NADPH coordination in characterized members of the NfsA subgroup82 and preferential metabolism of the NADPH cofactor is thought to be a key conserved characteristic of NfsA subgroup members. However, MSAs of the NfsA subgroup highlighted the presence of a sequence cluster that deviates from the typically conserved Arg203 (SSG6; Figure 3). To investigate, we experimentally characterized a selection of sequences from a range of NfsA SSGs to profile NAD(P)H preference. Indeed, SSG6 proteins



ADVANCED GENERATION AND ANALYSIS OF SSNS USING EXPANDED SEQUENCE SETS Databases such as InterPro, Pfam, and CATH-Gene3D84 are sequence and structure classification systems that provide an excellent resource for associating protein sequences with a superfamily or family of interest. However, each database utilizes different methods with distinctive perspectives, and thus, their data sets can contain subtly different sets of sequences. The vast majority of sequence and domain classifications are not defined by functional features; however, function is paramount for the biological questions of interest. Harnessing data from multiple databases should therefore result in a comprehensive sequence data set for a protein superfamily. In order to collate an allinclusive set of sequences to represent the NTR superfamily, we collected an exhaustive nonredundant set of all available sequences and structures that can be associated with the NTR superfamily from multiple databases.29 The resulting sequence set was manually filtered based on literature and experimental knowledge and contained 24 270 sequences that range between 150 and 1580 amino acids in length. This data set, curated in 2015, contains significantly fewer sequences than the number now associated with the PF00881 domain (>84 000 sequences in the UniProtKB release 2018_5), which can be attributed to redundant (identical) sequences in the database and the recent expansion of sequence space through modern sequencing technologies. To generate an SSN that retained all sequence information within each representative node, the Pythoscape G

DOI: 10.1021/acs.biochem.8b00473 Biochemistry XXXX, XXX, XXX−XXX

From the Bench

Biochemistry

Figure 4. Sequence similarity networks of the NTR superfamily. (A) A representative sequence similarity network of the NTR superfamily (generated with Pythoscape) at an E-value threshold of 1 × 10−18. Nodes represent sequence sets that contain 1−307 sequences grouped by 60% identity. Subgroups, as defined previously,29 are colored. (B) A sequence similarity network of the PF00881 50% identity representatives (as per Figure 1) is displayed at a threshold of 1 × 10−22. Subgroups are colored as per part A.



software85 was utilized. Note that Pythoscape typically requires tailoring for the user’s specific hardware environment. A 60% ID cutoff was used to condense the data set, which resulted in 5604 representative sets of sequences that each contained 1−307 individual members. Average pairwise BLAST E-values were calculated from the individual pairwise BLAST of each member of each representative sequence set. A representative SSN created using the Pythoscape platform is shown in Figure 4a colored by subgrouping classification. For comparative purposes, the CD-HIT-generated SSN in the above text is shown in Figure 4b. Overall, these networks emphasize the robust nature of SSN-based investigations; although created using different approaches (e.g., CD-HIT and SFLD/Pythoscape) and distinct data sets, they consistently provide similar topologies and sequence similarity relationships. Sequences in the CD-HIT data set (Figure 4b) that are not represented in Figure 4a are displayed as uncolored (gray) nodes and highlight the expansion of the NTR superfamily data set from that used in earlier investigations;29 SSN clustering is maintained, but the coverage of NTR superfamily sequence space is expanding. Advanced computational approaches can also be used to identify subgroup, or sequence cluster, specific variation, i.e., using active site profiling or substrate docking approaches.58,86 These variations can be used as graphical features for SSN nodes facilitating global evaluation (i.e., across the entire superfamily) and visualization of their distribution across distinct subgroups. Additionally, researchers can guide hypotheses of substrate and catalytic specificities through complementary approaches, such as genomic context analysis, i.e., investigations of the neighboring genes that form an operon with the gene of interest and the conservation of these genomic relationships among other members the sequence cluster, or alternatively, profiling abundance in microbiome or metagenome surveys.87−89

COMPLEXITIES OF INFERRING FUNCTION FROM SEQUENCE SIMILARITY RELATIONSHIPS Within a protein superfamily, diverse functions likely emerged from a common ancestor.9,90 Divergence, under a given selection pressure to retain a function, can create clusters of sequences that possess the same, or very similar, function (functional families). These sequences are likely to display a closer sequence similarity to each other compared to sequences from alternative functional families. In some cases, sequence identity between distinct functional families can be less than 20%, as they may have diverged millions of years ago, and/or their sequences have rapidly evolved due to strong selection forces. Intuitively, sequences that form robust clusters in an SSN may therefore share the same or similar functions. However, investigations of laboratory and natural protein evolution have demonstrated that a new protein function can emerge via a handful of mutational steps or even a single mutation.91−97 Thus, proteins can exhibit >90% sequence identity but encode distinct functions, as a new function has emerged and evolved only recently.98,99 Sequence signatures that differentiate distinct functions are thus ambiguous and highly complex; therefore, a single computational approach, including the one we describe here, is unlikely to separate all functional families. Ultimately, comprehensive analyses of a large sample of available sequence, structure, and function data, which can be facilitated by sequence analyses such as SSNs at both the global scale (superfamily) and local scale (family level), as well as detailed MSAs and phylogenetic analyses, are necessary to direct and inform hypothesis of functional divergence.



CONCLUDING REMARKS One of the ultimate goals in biochemistry is to reveal, in exhaustive terms, the relationships between sequence, structure, and function, thus elucidating the complete functional capacity of a protein fold. To address this aim with realistic experimental efforts (especially in consideration of the continuous and rapid expansion of sequence data), comprehensive bioinformatic H

DOI: 10.1021/acs.biochem.8b00473 Biochemistry XXXX, XXX, XXX−XXX

From the Bench

Biochemistry

potential to greatly expand our understanding of the functional repertoire in the biosphere, our ability to design and modulate protein function for biocatalysis, and ultimately improve our understanding of the diversity of organisms and cellular processes in nature.

analyses provide an extremely powerful tool to classify sequence diversity and identify unexplored sequence space. A vast amount of untapped potential for novel discoveries of protein and enzyme functions remains hidden in many large superfamilies. For example, BluB enzymes catalyze the unprecedented fragmentation of FMN for vitamin B12 biosynthesis,48 while other enzymes in the NTR superfamily typically utilize FMN as a cofactor for redox reactions.29 Prior to their discovery in 2007, BluB sequences were annotated as nitroreductase enzymes due to their similarity to the nitroreductase fold. As visualized in Figure 1, BluB enzymes form a distinct cluster of highly similar sequences within the NTR superfamily. The methods outlined here can help identify novel sequence space with the underlying hypothesis that novel functions may be encoded in these previously unexplored sequences. However, as discussed above, we would like to emphasize that the evolutionary divergence of functions in a superfamily is an extremely complicated process and a single bioinformatics approach cannot separate sequences into monofunctional clusters. It is essential to combine computational approaches with comprehensive experimental exploration. Recent advances in synthesis technology are dramatically decreasing the cost of gene synthesis, e.g., DropSynth.100 Therefore, alongside the establishment of high-throughput methods for activity screening, large scale experimental undertakings (>1000 genes) are now a financially and technically viable option for more laboratories. Prospecting for new functions is therefore becoming a more reachable objective. Systematic and highthroughput approaches can be particularly informative if the selection of target functions that are investigated (i.e., the substrates and ligands used in activity profiling) covers a diverse range of the potential functional repertoire of the superfamily.20,22,101−103 New sequence space, however, may exhibit distinct and unexpected functions from the known ones within the same superfamily. In such cases, systematic computational characterizations including genomic context analysis as well as in silico molecular docking may guide the prediction of potential substrates, reactions, and ligands for new enzymes and proteins.15,55,88,104,105 However, researchers must also remain cautious about the influence of experimental conditions. Protein function is often context-dependent, and there is a vast range of biological conditions that can found in nature, e.g., temperature, pH, salinity and/or the availability of cofactors, and interacting protein domains or modifying enzymes.106 It may be important to explore multiple experimental conditions to gain a more comprehensive understanding of the functional diversity within a superfamily.107 Systematic high-throughput phenotyping via “knock-out” effects on the host organism or “knock-in” effects on model organisms can be useful approaches to confirm the corresponding physiological substrates and pathways of newly discovered functions, e.g., leveraging technologies such as phenotypic microarrays108 or metabolomic approaches.109,110 Enhanced protein modeling approaches, which integrate metagenomic data and evolutionary relationships, also offer structural insights for novel sequence space, which has previously been unobtainable.111 As higher throughput experimental approaches become more established and accessible, concurrent advances in machine learning112 will also become increasingly important for the prediction, analyses, and extrapolation of the relationships between protein function, phenotype, and the underlying molecular details. Integration of the computational approaches we outline in this work with advanced experimental techniques has the



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.biochem.8b00473. Pipeline for the generation of in-house SSNs (PDF) Script to map IDs to clusters to identifiers (TXT)



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Tel: +1 (604) 822-8156. ORCID

Nobuhiko Tokuriki: 0000-0002-8235-1829 Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS N.T. acknowledges the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant (RGPIN 2017-04909). N.T. is a CIHR new investigator and a Michael Smith Foundation of Health Research (MSFHR) career investigator. P.C.B. acknowledges the National Institutes of Health (NIH) (R01 GM60595).



REFERENCES

(1) Jensen, R. A. (1976) Enzyme Recruitment in Evolution of New Function. Annu. Rev. Microbiol. 30, 409−425. (2) O’Brien, P. J., and Herschlag, D. (1999) Catalytic promiscuity and the evolution of new enzymatic activities. Chem. Biol. 6, R91−R105. (3) Copley, S. D. (2003) Enzymes with extra talents: moonlighting functions and catalytic promiscuity. Curr. Opin. Chem. Biol. 7, 265−272. (4) Aharoni, A., Gaidukov, L., Khersonsky, O., Gould, S., Roodveldt, C., and Tawfik, D. S. (2005) The “evolvability” of promiscuous protein functions. Nat. Genet. 37, 73−76. (5) Tokuriki, N., and Tawfik, D. S. (2009) Protein dynamism and evolvability. Science 324, 203−207. (6) Brown, S. D., and Babbitt, P. C. (2012) Inference of functional properties from large-scale analysis of enzyme superfamilies. J. Biol. Chem. 287, 35−42. (7) Brown, S. D., and Babbitt, P. C. (2014) New Insights about Enzyme Evolution from Large Scale Studies of Sequence and Structure Relationships. J. Biol. Chem. 289, 30221−30228. (8) Finn, R. D., Attwood, T. K., Babbitt, P. C., Bateman, A., Bork, P., Bridge, A. J., Chang, H. Y., Dosztányi, Z., El-Gebali, S., Fraser, M., Gough, J., Haft, D., Holliday, G. L., Huang, H., Huang, X., Letunic, I., Lopez, R., Lu, S., Marchler-Bauer, A., Mi, H., Mistry, J., Natale, D. A., Necci, M., Nuka, G., Orengo, C. A., Park, Y., Pesseat, S., Piovesan, D., Potter, S. C., Rawlings, N. D., Redaschi, N., Richardson, L., Rivoire, C., Sangrador-Vegas, A., Sigrist, C., Sillitoe, I., Smithers, B., Squizzato, S., Sutton, G., Thanki, N., Thomas, P. D., Tosatto, S. C., Wu, C. H., Xenarios, I., Yeh, L. S., Young, S. Y., and Mitchell, A. L. (2017) InterPro in 2017-beyond protein family and domain annotations. Nucleic Acids Res. 45, D190−D199. (9) Gerlt, J. A., and Babbitt, P. C. (2001) Divergent evolution of enzymatic function: mechanistically diverse superfamilies and functionally distinct suprafamilies. Annu. Rev. Biochem. 70, 209−246. (10) Almonacid, D. E., and Babbitt, P. C. (2011) Toward mechanistic classification of enzyme functions. Curr. Opin. Chem. Biol. 15, 435−442.

I

DOI: 10.1021/acs.biochem.8b00473 Biochemistry XXXX, XXX, XXX−XXX

From the Bench

Biochemistry (11) Vass, M., Kooistra, A. J., Yang, D., Stevens, R. C., Wang, M. W., and de Graaf, C. (2018) Chemical Diversity in the G Protein-Coupled Receptor Superfamily. Trends Pharmacol. Sci. 39, 494−512. (12) CAZypedia Consortium (2018) Ten years of CAZypedia: a living encyclopedia of carbohydrate-active enzymes. Glycobiology 28, 3−8. (13) Wallrapp, F. H., Pan, J. J., Ramamoorthy, G., Almonacid, D. E., Hillerich, B. S., Seidel, R., Patskovsky, Y., Babbitt, P. C., Almo, S. C., Jacobson, M. P., and Poulter, C. D. (2013) Prediction of function for the polyprenyl transferase subgroup in the isoprenoid synthase superfamily. Proc. Natl. Acad. Sci. U. S. A. 110, E1196−202. (14) Swinehart, W. E., and Jackman, J. E. (2015) Diversity in mechanism and function of tRNA methyltransferases. RNA Biol. 12, 398−411. (15) Huang, H., Carter, M. S., Vetting, M. W., Al-Obaidi, N., Patskovsky, Y., Almo, S. C., and Gerlt, J. A. (2015) A General Strategy for the Discovery of Metabolic Pathways: d-Threitol, l-Threitol, and Erythritol Utilization in Mycobacterium smegmatis. J. Am. Chem. Soc. 137, 14570−14573. (16) London, N., Farelli, J. D., Brown, S. D., Liu, C., Huang, H., Korczynska, M., Al-Obaidi, N. F., Babbitt, P. C., Almo, S. C., Allen, K. N., and Shoichet, B. K. (2015) Covalent Docking Predicts Substrates for Haloalkanoate Dehalogenase Superfamily Phosphatases. Biochemistry 54, 528−37. (17) Perez-Rueda, E., Hernandez-Guerrero, R., Martinez-Nuñez, M. A., Armenta-Medina, D., Sanchez, I., and Ibarra, J. A. (2018) Abundance, diversity and domain architecture variability in prokaryotic DNA-binding transcription factors. PLoS One 13, e0195332. (18) Atkinson, J. T., Campbell, I., Bennett, G. N., and Silberg, J. J. (2016) Cellular Assays for Ferredoxins: A Strategy for Understanding Electron Flow through Protein Carriers That Link Metabolic Pathways. Biochemistry 55, 7047−7064. (19) Rao, G., and Oldfield, E. (2016) Structure and Function of Four Classes of the 4Fe-4S Protein, IspH. Biochemistry 55, 4119−4129. (20) Baier, F., and Tokuriki, N. (2014) Connectivity between catalytic landscapes of the metallo-β-lactamase superfamily. J. Mol. Biol. 426, 2442−2456. (21) Colin, P. Y., Kintses, B., Gielen, F., Miton, C. M., Fischer, G., Mohamed, M. F., Hyvönen, M., Morgavi, D. P., Janssen, D. B., and Hollfelder, F. (2015) Ultrahigh-throughput discovery of promiscuous enzymes by picodroplet functional metagenomics. Nat. Commun. 6, 10008. (22) Mashiyama, S. T., Malabanan, M. M., Akiva, E., Bhosle, R., Branch, M. C., Hillerich, B., Jagessar, K., Kim, J., Patskovsky, Y., Seidel, R. D., Stead, M., Toro, R., Vetting, M. W., Almo, S. C., Armstrong, R. N., and Babbitt, P. C. (2014) Large-scale determination of sequence, structure, and function relationships in cytosolic glutathione transferases across the biosphere. PLoS Biol. 12, e1001843−e1001843. (23) Lukk, T., Sakai, A., Kalyanaraman, C., Brown, S. D., Imker, H. J., Song, L., Fedorov, A. A., Fedorov, E. V., Toro, R., Hillerich, B., Seidel, R., Patskovsky, Y., Vetting, M. W., Nair, S. K., Babbitt, P. C., Almo, S. C., Gerlt, J. A., and Jacobson, M. P. (2012) Homology models guide discovery of diverse enzyme specificities among dipeptide epimerases in the enolase superfamily. Proc. Natl. Acad. Sci. U. S. A. 109, 4122−4127. (24) Schnoes, A. M., Brown, S. D., Dodevski, I., and Babbitt, P. C. (2009) Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput. Biol. 5, e1000605. (25) Schnoes, A. M., Ream, D. C., Thorman, A. W., Babbitt, P. C., and Friedberg, I. (2013) Biases in the experimental annotations of protein function and their effect on our understanding of protein function space. PLoS Comput. Biol. 9, e1003063. (26) Glasner, M. E., Gerlt, J. A., and Babbitt, P. C. (2006) Mechanisms of protein evolution and their application to protein engineering. Adv. Enzymol. Relat. Areas Mol. Biol. 75, 193−239. (27) Gerlt, J. A., and Babbitt, P. C. (2009) Enzyme (re)design: lessons from natural evolution and computation. Curr. Opin. Chem. Biol. 13, 10−18.

(28) Renata, H., Wang, Z. J., and Arnold, F. H. (2015) Expanding the enzyme universe: accessing non-natural reactions by mechanismguided directed evolution. Angew. Chem., Int. Ed. 54, 3351−3367. (29) Akiva, E., Copp, J. N., Tokuriki, N., and Babbitt, P. C. (2017) Evolutionary and molecular foundations of multiple contemporary functions of the nitroreductase superfamily. Proc. Natl. Acad. Sci. U. S. A. 114, E9549−E9558. (30) Selvadurai, K., Wang, P., Seimetz, J., and Huang, R. H. (2014) Archaeal Elp3 catalyzes tRNA wobble uridine modification at C5 via a radical mechanism. Nat. Chem. Biol. 10, 810−812. (31) Wichelecki, D. J., Graff, D. C., Al-Obaidi, N., Almo, S. C., and Gerlt, J. A. (2014) Identification of the in vivo function of the highefficiency D-mannonate dehydratase in Caulobacter crescentus NA1000 from the enolase superfamily. Biochemistry 53, 4087−4089. (32) Zhang, X., Kumar, R., Vetting, M. W., Zhao, S., Jacobson, M. P., Almo, S. C., and Gerlt, J. A. (2015) A unique cis-3-hydroxy-l-proline dehydratase in the enolase superfamily. J. Am. Chem. Soc. 137, 1388− 1391. (33) Goble, A. M., Fan, H., Sali, A., and Raushel, F. M. (2011) Discovery of a cytokinin deaminase. ACS Chem. Biol. 6, 1036−1040. (34) Gerlt, J. A., Babbitt, P. C., Jacobson, M. P., and Almo, S. C. (2012) Divergent evolution in enolase superfamily: strategies for assigning functions. J. Biol. Chem. 287, 29−34. (35) Goble, A. M., Feng, Y., Raushel, F. M., and Cronan, J. E. (2013) Discovery of a cAMP deaminase that quenches cyclic AMP-dependent regulation. ACS Chem. Biol. 8, 2622−2629. (36) Pitsawong, W., Hoben, J. P., and Miller, A. F. (2014) Understanding the Broad Substrate Repertoire of Nitroreductase Based on Its Kinetic Mechanism. J. Biol. Chem. 289, 15203−15214. (37) Prosser, G. A., Copp, J. N., Mowday, A. M., Guise, C. P., Syddall, S. P., Williams, E. M., Horvat, C. N., Swe, P. M., Ashoorzadeh, A., Denny, W. A., Smaill, J. B., Patterson, A. V., and Ackerley, D. F. (2013) Creation and screening of a multi-family bacterial oxidoreductase library to discover novel nitroreductases that efficiently activate the bioreductive prodrugs CB1954 and PR-104A. Biochem. Pharmacol. 85, 1091−1103. (38) Copp, J. N., Mowday, A. M., Williams, E. M., Guise, C. P., Ashoorzadeh, A., Sharrock, A. V., Flanagan, J. U., Smaill, J. B., Patterson, A. V., and Ackerley, D. F. (2017) Engineering a Multifunctional Nitroreductase for Improved Activation of Prodrugs and PET Probes for Cancer Gene Therapy. Cell Chem. Biol. 24, 391−403. (39) Curado, S., Anderson, R. M., Jungblut, B., Mumm, J., Schroeter, E., and Stainier, D. Y. (2007) Conditional targeted cell ablation in zebrafish: a new tool for regeneration studies. Dev. Dyn. 236, 1025− 1035. (40) Zhang, L., Routsong, R., Nguyen, Q., Rylott, E. L., Bruce, N. C., and Strand, S. E. (2017) Expression in grasses of multiple transgenes for degradation of munitions compounds on live-fire training ranges. Plant Biotechnol J. 15, 624−633. (41) Zenno, S., Koike, H., Tanokura, M., and Saigo, K. (1996) Gene cloning, purification, and characterization of NfsB, a minor oxygeninsensitive nitroreductase from Escherichia coli, similar in biochemical properties to FRase I, the major flavin reductase in Vibrio fischeri. J. Biochem. 120, 736−744. (42) Takeda, K., Iizuka, M., Watanabe, T., Nakagawa, J., Kawasaki, S., and Niimura, Y. (2007) Synechocystis DrgA protein functioning as nitroreductase and ferric reductase is capable of catalyzing the Fenton reaction. FEBS J. 274, 1318−1327. (43) Hou, F., Miyakawa, T., Kitamura, N., Takeuchi, M., Park, S. B., Kishino, S., Ogawa, J., and Tanokura, M. (2015) Structure and reaction mechanism of a novel enone reductase. FEBS J. 282, 1526−1537. (44) Lee, S. W., Mitchell, D. A., Markley, A. L., Hensler, M. E., Gonzalez, D., Wohlrab, A., Dorrestein, P. C., Nizet, V., and Dixon, J. E. (2008) Discovery of a widely distributed toxin biosynthetic gene cluster. Proc. Natl. Acad. Sci. U. S. A. 105, 5879−5884. (45) Mermod, M., Mourlane, F., Waltersperger, S., Oberholzer, A. E., Baumann, U., and Solioz, M. (2010) Structure and function of CinD (YtjD) of Lactococcus lactis, a copper-induced nitroreductase involved in defense against oxidative stress. J. Bacteriol. 192, 4172−4180. J

DOI: 10.1021/acs.biochem.8b00473 Biochemistry XXXX, XXX, XXX−XXX

From the Bench

Biochemistry (46) Bang, S. Y., Kim, J. H., Lee, P. Y., Bae, K. H., Lee, J. S., Kim, P. S., Lee, D. H., Myung, P. K., Park, B. C., and Park, S. G. (2012) Confirmation of Frm2 as a novel nitroreductase in Saccharomyces cerevisiae. Biochem. Biophys. Res. Commun. 423, 638−641. (47) Phatarphekar, A., and Rokita, S. E. (2016) Functional analysis of iodotyrosine deiodinase from drosophila melanogaster. Protein Sci. 25, 2187−2195. (48) Taga, M. E., Larsen, N. A., Howard-Jones, A. R., Walsh, C. T., and Walker, G. C. (2007) BluB cannibalizes flavin to form the lower ligand of vitamin B12. Nature 446, 449−453. (49) Gondry, M., Lautru, S., Fusai, G., Meunier, G., Ménez, A., and Genet, R. (2001) Cyclic dipeptide oxidase from Streptomyces noursei Isolation, purification and partial characterization of a novel, amino acyl alpha,beta-dehydrogenase. Eur. J. Biochem. 268, 1712−1721. (50) Bashiri, G., Rehan, A. M., Sreebhavan, S., Baker, H. M., Baker, E. N., and Squire, C. J. (2016) Elongation of the poly-γ-glutamate tail of F420 requires both domains of the F420:γ-glutamyl ligase (FbiB) of Mycobacterium tuberculosis. J. Biol. Chem. 291, 6882−94. (51) Kim, K. S., Pelton, J. G., Inwood, W. B., Andersen, U., Kustu, S., and Wemmer, D. E. (2010) The Rut pathway for pyrimidine degradation: novel chemistry and toxicity problems. J. Bacteriol. 192, 4089−4102. (52) Takahashi, S., Furuya, T., Ishii, Y., Kino, K., and Kirimura, K. (2009) Characterization of a flavin reductase from a thermophilic dibenzothiophene-desulfurizing bacterium, Bacillus subtilis WU-S2B. J. Biosci Bioeng 107, 38−41. (53) Atkinson, H. J., Morris, J. H., Ferrin, T. E., and Babbitt, P. C. (2009) Using sequence similarity networks for visualization of relationships across diverse protein superfamilies. PLoS One 4, e4345. (54) Gerlt, J. A. (2017) Genomic Enzymology: Web Tools for Leveraging Protein Family Sequence-Function Space and Genome Context to Discover Novel Functions. Biochemistry 56, 4293−4308. (55) Ahmed, F. H., Carr, P. D., Lee, B. M., Mohamed, A. E., Hong, N. S., Flanagan, J., Taylor, M. C., Greening, C., and Jackson, C. J. (2015) Sequence-Structure-Function Classification of a Catalytically Diverse Oxidoreductase Superfamily in Mycobacteria. J. Mol. Biol. 427, 3554− 3571. (56) Davidson, R., Baas, B.-J., Akiva, E., Holliday, G. L., Polacco, B. J., LeVieux, J. A., Pullara, C. R., Zhang, Y. J., Whitman, C. P., and Babbitt, P. C. (2018) A global view of structure-function relationships in the tautomerase superfamily. J. Biol. Chem. 293, 2342−2357. (57) Gerlt, J. A., Bouvier, J. T., Davidson, D. B., Imker, H. J., Sadkhin, B., Slater, D. R., and Whalen, K. L. (2015) Enzyme Function InitiativeEnzyme Similarity Tool (EFI-EST): A web tool for generating protein sequence similarity networks. Biochim. Biophys. Acta, Proteins Proteomics 1854, 1019−1037. (58) Knutson, S. T., Westwood, B. M., Leuthaeuser, J. B., Turner, B. E., Nguyendac, D., Shea, G., Kumar, K., Hayden, J. D., Harper, A. F., Brown, S. D., Morris, J. H., Ferrin, T. E., Babbitt, P. C., and Fetrow, J. S. (2017) An approach to functionally relevant clustering of the protein universe: Active site profile-based clustering of protein structures and sequences. Protein Sci. 26, 677−699. (59) Pegg, S. C. H., Brown, S. D., Ojha, S., Seffernick, J., Meng, E. C., Morris, J. H., Chang, P. J., Huang, C. C., Ferrin, T. E., and Babbitt, P. C. (2006) Leveraging Enzyme Structure−Function Relationships for Functional Inference and Experimental Design: The Structure− Function Linkage Database. Biochemistry 45, 2545−2555. (60) Akiva, E., Brown, S., Almonacid, D. E., Barber, A. E., 2nd, Custer, A. F., Hicks, M. A., Huang, C. C., Lauck, F., Mashiyama, S. T., Meng, E. C., Mischel, D., Morris, J. H., Ojha, S., Schnoes, A. M., Stryke, D., Yunes, J. M., Ferrin, T. E., Holliday, G. L., and Babbitt, P. C. (2014) The Structure-Function Linkage Database. Nucleic Acids Res. 42, D521− D530. (61) Finn, R. D., Coggill, P., Eberhardt, R. Y., Eddy, S. R., Mistry, J., Mitchell, A. L., Potter, S. C., Punta, M., Qureshi, M., Sangrador-Vegas, A., Salazar, G. A., Tate, J., and Bateman, A. (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44, D279−85.

(62) Li, W., and Godzik, A. (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658−1659. (63) Chauviac, F. X., Bommer, M., Yan, J., Parkin, G., Daviter, T., Lowden, P., Raven, E. L., Thalassinos, K., and Keep, N. H. (2012) Crystal Structure of Reduced MsAcg, a Putative Nitroreductase from Mycobacterium smegmatis and a Close Homologue of Mycobacterium tuberculosis Acg. J. Biol. Chem. 287, 44372−44383. (64) Hu, Y., and Coates, A. R. (2011) Mycobacterium tuberculosis acg gene is required for growth and virulence in vivo. PLoS One 6, e20958. (65) Enright, A. J., Van Dongen, S., and Ouzounis, C. A. (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30, 1575−1584. (66) Shannon, P., Markiel, A., Ozier, O., Baliga, N. S., Wang, J. T., Ramage, D., Amin, N., Schwikowski, B., and Ideker, T. (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498−2504. (67) Zenno, S., Koike, H., Kumar, A. N., Jayaraman, R., Tanokura, M., and Saigo, K. (1996) Biochemical characterization of NfsA, the Escherichia coli major nitroreductase exhibiting a high amino acid sequence homology to Frp, a Vibrio harveyi flavin oxidoreductase. J. Bacteriol. 178, 4508−4514. (68) Thomas, S. R., McTamney, P. M., Adler, J. M., Laronde-Leblanc, N., and Rokita, S. E. (2009) Crystal Structure of Iodotyrosine Deiodinase, a Novel Flavoprotein Responsible for Iodide Salvage in Thyroid Glands. J. Biol. Chem. 284, 19659−19667. (69) Fang, H., Kang, J., and Zhang, D. (2017) Microbial production of vitamin B12: a review and future perspectives. Microb. Cell Fact. 16, 15. (70) Edgar, R. C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792− 1797. (71) Sievers, F., Wilm, A., Dineen, D., Gibson, T. J., Karplus, K., Li, W., Lopez, R., McWilliam, H., Remmert, M., Söding, J., Thompson, J. D., and Higgins, D. G. (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539−539. (72) Armougom, F., Moretti, S., Poirot, O., Audic, S., Dumas, P., Schaeli, B., Keduas, V., and Notredame, C. (2006) Expresso: automatic incorporation of structural information in multiple sequence alignments using 3D-Coffee. Nucleic Acids Res. 34, W604−8. (73) Westbrook, J., Feng, Z., Jain, S., Bhat, T. N., Thanki, N., Ravichandran, V., Gilliland, G. L., Bluhm, W., Weissig, H., Greer, D. S., Bourne, P. E., and Berman, H. M. (2002) The Protein Data Bank: unifying the archive. Nucleic Acids Res. 30, 245. (74) Schrödinger, L. (2010) PyMOL The PyMOL Molecular Graphics System. (75) Pettersen, E. F., Goddard, T. D., Huang, C. C., Couch, G. S., Greenblatt, D. M., Meng, E. C., and Ferrin, T. E. (2004) UCSF Chimera-a visualization system for exploratory research and analysis. J. Comput. Chem. 25, 1605−1612. (76) Biasini, M., Bienert, S., Waterhouse, A., Arnold, K., Studer, G., Schmidt, T., Kiefer, F., Gallo Cassarino, T., Bertoni, M., Bordoli, L., and Schwede, T. (2014) SWISS-MODEL: modelling protein tertiary and quaternary structure using evolutionary information. Nucleic Acids Res. 42, W252−8. (77) Osipovitch, M., Lambrecht, M., Baker, C., Madha, S., Mills, J. L., Craig, P. A., and Bernstein, H. J. (2015) Automated protein motif generation in the structure-based protein function prediction tool ProMOL. J. Struct. Funct. Genomics 16, 101−111. (78) Holm, L., and Laakso, L. M. (2016) Dali server update. Nucleic Acids Res. 44, W351−5. (79) McKay, T., Hart, K., Horn, A., Kessler, H., Dodge, G., Bardhi, K., Bardhi, K., Mills, J. L., Bernstein, H. J., and Craig, P. A. (2015) Annotation of proteins of unknown function: initial enzyme results. J. Struct. Funct. Genomics 16, 43−54. (80) Yu, T. Y., Mok, K. C., Kennedy, K. J., Valton, J., Anderson, K. S., Walker, G. C., and Taga, M. E. (2012) Active site residues critical for flavin binding and 5,6-dimethylbenzimidazole biosynthesis in the flavin destructase enzyme BluB. Protein Sci. 21, 839−849. K

DOI: 10.1021/acs.biochem.8b00473 Biochemistry XXXX, XXX, XXX−XXX

From the Bench

Biochemistry (81) Phatarphekar, A., Buss, J. M., and Rokita, S. E. (2014) Iodotyrosine deiodinase: a unique flavoprotein present in organisms of diverse phyla. Mol. BioSyst. 10, 86−92. (82) Kobori, T., Sasaki, H., Lee, W. C., Zenno, S., Saigo, K., Murphy, M. E., and Tanokura, M. (2001) Structure and site-directed mutagenesis of a flavoprotein from Escherichia coli that reduces nitrocompounds: alteration of pyridine nucleotide binding by a single amino acid substitution. J. Biol. Chem. 276, 2816−2823. (83) Prosser, G. A., Copp, J. N., Syddall, S. P., Williams, E. M., Smaill, J. B., Wilson, W. R., Patterson, A. V., and Ackerley, D. F. (2010) Discovery and evaluation of Escherichia coli nitroreductases that activate the anti-cancer prodrug CB1954. Biochem. Pharmacol. 79, 678− 87. (84) Sillitoe, I., Lewis, T., and Orengo, C. (2015) Using CATHGene3D to Analyze the Sequence, Structure, and Function of Proteins. Curr. Protoc Bioinformatics 50, 1.28.1−1.28.21. (85) Barber, A. E., and Babbitt, P. C. (2012) Pythoscape: a framework for generation of large protein similarity networks. Bioinformatics 28, 2845−2846. (86) Tian, B.-X., Wallrapp, F. H., Holiday, G. L., Chow, J. Y., Babbitt, P. C., Poulter, C. D., and Jacobson, M. P. (2014) Predicting the functions and specificity of triterpenoid synthases: a mechanism-based multi-intermediate docking approach. PLoS Comput. Biol. 10, e1003874. (87) Zhang, X., Carter, M. S., Vetting, M. W., San Francisco, B., Zhao, S., Al-Obaidi, N. F., Solbiati, J. O., Thiaville, J. J., de Crécy-Lagard, V., Jacobson, M. P., Almo, S. C., and Gerlt, J. A. (2016) Assignment of function to a domain of unknown function: DUF1537 is a new kinase family in catabolic pathways for acid sugars. Proc. Natl. Acad. Sci. U. S. A. 113, E4161−9. (88) Zhao, S., Kumar, R., Sakai, A., Vetting, M. W., Wood, B. M., Brown, S., Bonanno, J. B., Hillerich, B. S., Seidel, R. D., Babbitt, P. C., Almo, S. C., Sweedler, J. V., Gerlt, J. A., Cronan, J. E., and Jacobson, M. P. (2013) Discovery of new enzymes and metabolic pathways by using structure and genome context. Nature 502, 698−702. (89) Levin, B. J., Huang, Y. Y., Peck, S. C., Wei, Y., Martínez-Del Campo, A., Marks, J. A., Franzosa, E. A., Huttenhower, C., and Balskus, E. P. (2017) A prominent glycyl radical enzyme in human gut microbiomes metabolizestrans-4-hydroxy-l-proline. Science 355, eaai8386. (90) Babbitt, P. C., and Gerlt, J. A. (1997) Understanding enzyme superfamilies. Chemistry As the fundamental determinant in the evolution of new catalytic activities. J. Biol. Chem. 272, 30591−30594. (91) Yang, G., Hong, N., Baier, F., Jackson, C. J., and Tokuriki, N. (2016) Conformational Tinkering Drives Evolution of a Promiscuous Activity through Indirect Mutational Effects. Biochemistry 55, 4583− 4593. (92) Natarajan, C., Jendroszek, A., Kumar, A., Weber, R. E., Tame, J. R. H., Fago, A., and Storz, J. F. (2018) Molecular basis of hemoglobin adaptation in the high-flying bar-headed goose. PLoS Genet. 14, e1007331. (93) Sugrue, E., Scott, C., and Jackson, C. J. (2017) Constrained evolution of a bispecific enzyme: lessons for biocatalyst design. Org. Biomol. Chem. 15, 937−946. (94) Scott, C., Jackson, C. J., Coppin, C. W., Mourant, R. G., Hilton, M. E., Sutherland, T. D., Russell, R. J., and Oakeshott, J. G. (2009) Catalytic improvement and evolution of atrazine chlorohydrolase. Appl. Environ. Microbiol. 75, 2184−2191. (95) Joerger, A. C., Mayer, S., and Fersht, A. R. (2003) Mimicking natural evolution in vitro: an N-acetylneuraminate lyase mutant with an increased dihydrodipicolinate synthase activity. Proc. Natl. Acad. Sci. U. S. A. 100, 5694−5699. (96) Schmidt, D. M. Z., Mundorff, E. C., Dojka, M., Bermudez, E., Ness, J. E., Govindarajan, S., Babbitt, P. C., Minshull, J., and Gerlt, J. A. (2003) Evolutionary Potential of (β/α) 8-Barrels: Functional Promiscuity Produced by Single Substitutions in the Enolase Superfamily. Biochemistry 42, 8387−8393.

(97) McLoughlin, S. Y., and Copley, S. D. (2008) A compromise required by gene sharing enables survival: Implications for evolution of new enzyme activities. Proc. Natl. Acad. Sci. U. S. A. 105, 13497−13502. (98) Copley, S. D. (2009) Evolution of efficient pathways for degradation of anthropogenic chemicals. Nat. Chem. Biol. 5, 559−566. (99) Afriat-Jurnou, L., Jackson, C. J., and Tawfik, D. S. (2012) Reconstructing a missing link in the evolution of a recently diverged phosphotriesterase by active-site loop remodeling. Biochemistry 51, 6047−6055. (100) Plesa, C., Sidore, A. M., Lubock, N. B., Zhang, D., and Kosuri, S. (2018) Multiplexed gene synthesis in emulsions for exploring protein functional landscapes. Science 359, 343−347. (101) Bastard, K., Smith, A. A. T., Vergne-Vaxelaire, C., Perret, A., Zaparucha, A., De Melo-Minardi, R., Mariage, A., Boutard, M., Debard, A., Lechaplais, C., Pelle, C., Pellouin, V., Perchat, N., Petit, J. L., Kreimeyer, A., Medigue, C., Weissenbach, J., Artiguenave, F., De Berardinis, V., Vallenet, D., and Salanoubat, M. (2014) Revealing the hidden functional diversity of an enzyme family. Nat. Chem. Biol. 10, 42−49. (102) Huang, H., Pandya, C., Liu, C., Al-Obaidi, N. F., Wang, M., Zheng, L., Toews Keating, S., Aono, M., Love, J. D., Evans, B., Seidel, R. D., Hillerich, B. S., Garforth, S. J., Almo, S. C., Mariano, P. S., DunawayMariano, D., Allen, K. N., and Farelli, J. D. (2015) Panoramic view of a superfamily of phosphatases through substrate profiling. Proc. Natl. Acad. Sci. U. S. A. 112, E1974−83. (103) Vetting, M. W., Al-Obaidi, N., Zhao, S., Kim, J., Wichelecki, D. J., Bouvier, J. T., Solbiati, J. O., Vu, H., Zhang, X., Rodionov, D. A., Love, J. D., Hillerich, B. S., Seidel, R. D., Quinn, R. J., Osterman, A. L., Cronan, J. E., Jacobson, M. P., Gerlt, J. A., and Almo, S. C. (2015) Experimental strategies for functional annotation and metabolism discovery: targeted screening of solute binding proteins and unbiased panning of metabolomes. Biochemistry 54, 909−931. (104) Ehrhardt, M. K. G., Warring, S. L., and Gerth, M. L. (2018) Screening Chemoreceptor-Ligand Interactions by High-Throughput Thermal-Shift Assays. Methods Mol. Biol. 1729, 281−290. (105) Calhoun, S., Korczynska, M., Wichelecki, D. J., San Francisco, B., Zhao, S., Rodionov, D. A., Vetting, M. W., Al-Obaidi, N. F., Lin, H., O’Meara, M. J., Scott, D. A., Morris, J. H., Russel, D., Almo, S. C., Osterman, A. L., Gerlt, J. A., Jacobson, M. P., Shoichet, B. K., and Sali, A. (2018) Prediction of enzymatic pathways by integrative pathway mapping. eLife 7, 683. (106) Baier, F., Chen, J., Solomonson, M., Strynadka, N. C., and Tokuriki, N. (2015) Distinct Metal Isoforms Underlie Promiscuous Activity Profiles of Metalloenzymes. ACS Chem. Biol. 10, 1684−1693. (107) Vanacek, P., Sebestova, E., Babkova, P., Bidmanova, S., Daniel, L., Dvorak, P., Stepankova, V., Chaloupkova, R., Brezovsky, J., Prokop, Z., and Damborsky, J. (2018) Exploration of Enzyme Diversity by Integrating Bioinformatics with Expression Analysis and Biochemical Characterization. ACS Catal. 8, 2402−2412. (108) Bochner, B. R., Gadzinski, P., and Panomitros, E. (2001) Phenotype microarrays for high-throughput phenotypic testing and assay of gene function. Genome Res. 11, 1246−1255. (109) Fuhrer, T., and Zamboni, N. (2015) High-throughput discovery metabolomics. Curr. Opin. Biotechnol. 31, 73−78. (110) Reimer, L. C., Spura, J., Schmidt-Hohagen, K., and Schomburg, D. (2014) High-throughput screening of a Corynebacterium glutamicum mutant library on genomic and metabolic level. PLoS One 9, e86799. (111) Ovchinnikov, S., Park, H., Varghese, N., Huang, P. S., Pavlopoulos, G. A., Kim, D. E., Kamisetty, H., Kyrpides, N. C., and Baker, D. (2017) Protein structure determination using metagenome sequence data. Science 355, 294−298. (112) Hopf, T. A., Ingraham, J. B., Poelwijk, F. J., Schärfe, C. P., Springer, M., Sander, C., and Marks, D. S. (2017) Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128−135.

L

DOI: 10.1021/acs.biochem.8b00473 Biochemistry XXXX, XXX, XXX−XXX