Can Small Laboratories Do Structural Genomics? - Crystal Growth

7 Nov 2007 - Publication Date (Web): November 7, 2007. Copyright © 2007 American ... Marc L. Pusey. Crystal Growth & Design 2011 11 (4), 1135-1142...
0 downloads 0 Views 460KB Size
CRYSTAL GROWTH & DESIGN

Can Small Laboratories Do Structural Genomics?† Ronny C. Hughes and Joseph D. Ng* Laboratory for Structural Biology, Department of Biological Sciences, UniVersity of Alabama in HuntsVille, HuntsVille, Alabama 35899

2007 VOL. 7, NO. 11 2226–2238

ABSTRACT: Structural genomics centers have often been regarded as the “members only” clubs associated with expensive equipment, material, and personnel. Although much expense has been invested worldwide in structural genomics consortia, there has been an enormous amount of criticism on the apparent disproportional output produced relative to the amount of money spent. Here, we review and highlight some of the major achievements in structural genomics projects coupled to X-ray crystallography. Structural genomics has advanced technologies in high throughput bioinformatics, cloning and recombinant protein expression, crystal growth, and crystallographic structure determination. If cost can be reduced and feasibility increased then a modest-sized laboratory research group can significantly contribute to structural genomics efforts. We demonstrate this by examining a sequenced and annotated hyperthermophilic archaeal genome (about 2 million base pairs) with our own structural genomics methods. By incorporating proven strategies developed from large structural genomic centers with practical innovations, we have constructed a mini-pipeline in which a small group consisting of as few as two people can survey 1500 open reading frames for cloning, expression, crystallization, and structure determination for less than $200 000. Introduction Structural genomics has been linked with factory production line processes to decipher three-dimensional structures of all tractable macromolecules encoded by complete genomes. Large centers worldwide have participated in this endeavor utilizing NMR spectroscopy and X-ray crystallography in the hope of yielding structure–function information of thousands of proteins. Structural genomic centers contribute approximately 20% of the new structures (1 out of every 5 new structures are contributed by SG centers) going into the Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB), and half of new structurally characterized protein families reported to-date are a result of SG efforts.1 In the course of this achievement, advanced technology has also been developed in high throughput gene cloning,2,3 recombinant protein production,4 macromolecular crystallization,5 NMR spectroscopy,6 and X-ray crystallography.7 Despite the successful outcomes so far and with many more genome sequences becoming available, a large number of predicted protein families still lack a structural representative in the PDB. In other words, there is still not yet a three-dimensional protein structure that represents every class of protein function revealed by global or local sequence analysis or those associated with unique or novel biological pathways (for further information on protein family classifications, see refs 8 and 9). There is still a great demand to solve thousands of more protein structures. It has been argued that only the large structural genomic centers can afford to continue to pursue the remaining unsolved structures because of their accessibility to financial resources and manpower. Consequently, the large centers performing structural genomics have been viewed as being too exclusive such that small conventional laboratory groups are not able to participate in and contribute to structural genomics. As most worldwide structural genomics programs are in the midst of their second phase continuation, we bring to attention technologies and applications acquired from large structural † Part of the special issue (Vol 7, issue 11) on the 11th International Conference on the Crystallization of Biological Macromolecules, Québec, Canada, August 16–21, 2006 (preconference August 13–16, 2006). * Corresponding author. E-mail: [email protected]

genomic center achievements in which modest-sized research teams can now perform structural genomic work without extravagant budget requirements. Because X-ray crystallography has been the principal method to determine three-dimensional protein structures, especially those that are greater than 35 kD, we will focus our discussion on doing structural genomics associated with X-ray crystallography. Impact of Structural Genomic Centers and Their Current Operations In the Monty Python classic comedy film “The Life of Brian,” it was asked “what have the Romans ever done for us—besides the aqueduct, sanitation, roads, irrigation, medicine, education, public health, etc.”. Similarly, we can also ask what have structural genomics centers ever done for us besides contributing one-fifth of the new structures going into the Protein Data Bank, fabricating automation for high throughput protein expression and crystallization setups, fast and efficient pipeline crystallography, lowering the average cost of structure determination, and others. Although there has been considerable criticism on the novelty, cost, and impact of structures solved by structural genomics centers, we highlight some significant achievements that have contributed to important technology and consequently advanced the field of structural biology. Other contributions are also listed in Table 1, highlighting examples of structural genomics centers along with their objectives and accomplishments. The list is not comprehensive but only exemplifies SG centers that have deposited, to the best of our knowledge and according to the RCSB PDB, 10 structures or more in the protein database as of June 6, 2007. Bioinformatics. Structural genomics begins with the nucleotide sequence of an entire genome. The initial task is to computationally identify gene sequences that correspond to coding regions targeted for recombinant protein expression. Bioinformatics tools are required to manage and archive information important for protein family classifications, template design for cloning primers, understanding evolutionary relationships, and gene product analyses.10 Structural genomics centers have made significant contributions in developing high throughput software that has accelerated this first step of going from gene to structure at the bioinformatics level. Although there have

10.1021/cg700706a CCC: $37.00  2007 American Chemical Society Published on Web 11/07/2007

Reviews

Crystal Growth & Design, Vol. 7, No. 11, 2007 2227 Table 1. Contributions of SG centers

center Berkeley Structural Genomics Center (BSGC), USA

targeted organism(s)

objectives

Mycoplasma genitalium and establish structural representation Mycoplasma pneumoniae of pathogenic genomes

no. of structures deposited in PDBa 92

technology contributions column protein refolding and robotic automation of cloning

Center for Eukaryotic Structural Arabidopsis thaliana Genomics (CESG), USA

obtain structures of Arabidopsis thaliana proteins, and eukaryotic targets

100

high throughput production (HTP) of eukaryotic proteins, including cell-free expression.

The Joint Center for Structural Genomics (JCSG), USA

human, C. elegans, Thermotoga maritima

help gain a structural understanding of the central machinery for life.

438

automation of cloning and protein production and tools for remote data collection

The Midwest Center for Structural Genomics (MCSG), USA

selected eukaryotic and prokaryotic targets

solve fundamental protein structures from all three domains of life

632

tools for HTP cloning and protein purification

New York Structural Genomics selected human targets Research Consortium and model organisms (NYSGRC), USA

solve structures of biological importance from model organisms and human

388

experimental and computational tools for SG

Northeast Structural Genomics Consortium, (NESGC), USA

selected human targets and model organisms

solve structures from large underrepresented protein domain families

392

robotic cloning, and NMR techniques for SG

Structural Genomics of Pathogenic Protozoa Consortium (SGPP), USA

pathogenic protozoa

solve structures of pathogenic proteins of importance to human health.

TB Structural Genomics Consortium (TBSGC), USA

Mycobacterium tuberculosis solve structures of potential TB drug targets

The Southeast Collaboratory for Structural Genomics (SECSG), USA

C. elegans, Pyrococcus furiosus, human

solve structures from target organisms and selected heath related human proteins

92

pipeline crystallography and remote data collection

Structure 2 Function Project (S2F), USA

Haemophilus influenzae

solve the structure of ”hypothetical” or functionally uncharacterized proteins

49

cost minimization methods for SG

Riken (RSGI), Japan

mouse, A. thaliana and Thermus thermophilus

solve structures from target organisms that are biologically important or human health related

Structural Genomics in Europe and England (SPINE), Europe

human and humanhealth-related pathogens

solve structures of proteins relevant to human health

97

nanoliter crystallization and Imaging

The Israel Structural Proteomics center (ISPC), Isreal

human, and specific eukaryotic and prokaryotic targets

solve structures of proteins related to human health

14

methods in HTP protein production, including human protein.

Montreal-Kingston Bacterial Structural Genomics Initiative (BSGI), Canada

E. coli and specific targets

solve structures of proteins involved in small molecule metabolism, and proteins that are involved in pathogenesis

47

web-based data management systems for SG

Mycobacterium tuberculosis Structural Proteomics Project, Germany (XMTB), Germany

Mycobacterium tuberculosis use of structure-based drug discovery to combat TB

15

integration of drug discovery and SG

Structural Genomics Consortium (SGC), England, Canada, Sweden

targets from human and human pathogens

a

solve structures of proteins relevant to human health and disease.

41

119

2497

210

HTP plating and colony isolation and novel coexpression methods tools and software for remote and automated data collection

cell-free expression of proteins

chemical screening methods to facilitate protein stability and crystallization

Structural genomics centers that have deposited 10 structures or more in the RCSB PDB as of June 6, 2007.

been numerous advances made in bioinformatics techniques and program packages available for rapid computation, the following examples represent some major contributions that have facilitated the field of structural biology. The most useful and least acknowledged implementation in structural genomics is the graphical user interface (GUI) with bioinformatics programs. The ability of permitting an individual to interact with a computer program in a graphical mode instead of a character-based mode existed long before the structural genomics era. However, coupling user friendly GUI with

complex gene analysis programs has eased the effort of data entry and made output reads more facile. As the result, gene sequences can be easily loaded and subsequent bioinformatics manipulations can be executed without using a cumbersome and enigmatic command line interface. Structural genomics has refined and advanced new or preexisting prediction tools for nucleic acid and protein sequences. The basic local alignment search tool (BLAST) has been one of the most useful tools to find regions of local similarity between nucleic acid and protein sequences. It has been

2228 Crystal Growth & Design, Vol. 7, No. 11, 2007

beneficial in linking functional and evolutionary relationships between sequences as well as identifying gene family members. The creation and advancement of comprehensive databases have made intelligent target selection of genomic sequences easily attainable. Some examples include the NCBI, SwissProt/treEMBL,11 Comprehensive Microbial Resources (CMR),12 GeneCard (http://www.genecards.org/cgi-bin/ carddisp.pl?gene)AGTR1), and Human Proteome Reference Database and Mammalian Gene Collection (MGC).13 Prediciting open reading frames and their subsequent protein sequences have also been equally daunting without the availability of SignalP (http://www.cbs.dtu.dk/services/SignalP/) for signal peptides; TMHMM (http://www.cbs.dtu.dk/services/TMHMM/) and HMMTOP (http://www.enzim.hu/hmmtop/) for membrane proteins; and PSORT (http://psort.nibb.ac.jp/) for cellular location. Putative protein sequences are required to be categorized into functional categories in which programs such as Pfam (http:// www.sanger.ac.uk/Software/Pfam/), COG,14 and TIGRFRAM (http://www.tigr.org/TIGRFAMs/) have been indispensable for protein family assignments. Once protein sequences are known, identifying primary structures that correspond to common folding clusters or those that would be predicted to be disordered are important for cloning strategies. Programs that have greatly advanced this effort include disEMBL (http://dis.embl.de/) for disordered region predictions; PRODOM, SMART,15 3Dee (http://www.compbio.dundee.ac.uk/3Dee/) and CCD16 for domain assignment; and InterProScan,17,18 PRINT, 19and PROSITE (http://ca.expasy.org/prosite/) for integrated sequence motif recognition. Many proteins are not stable by themselves but require multipart interactions with other molecules. The task of predicting protein interactions is even more complex and far reaching in terms of finding gene products that can be potential molecular partners for subsequent crystallization and structure determination. Databases such as DIP20 and STRING21 have provided great insight in targeting appropriate molecular associates that will give rise to highly probable multi-subunit complexes. Finally, databases of three-dimensional macromolecular structures coupled to programs that allow protein topological comparisons have accelerated the identification of protein families and, as a result, have influenced the target priorities. The RCSB Protein Data Bank is the most prominent collection of three-dimensional coordinates of macromolecules and their classification can be further cataloged in the databases SCOP,22 CATH,23 FSSP, 24 and the Enzyme Structure Database (http://www.biochem.ucl.ac.uk/bsm/enzymes). Programs such as DALi (http://www.ebi.ac.uk/dali/) and CE (http:// bioinformatics.albany.edu/∼cemc/) have integrated the usage of three-dimensional coordinates to search and compare structural homologies among known structures. Structures determined by structural genomics have been exploited for the building of structure modeling tools based on homology modeling such as Modeller (http://salilab.org/modeller/), SwissModel (http:// swissmodel.expasy.org//SWISS-MODEL.html), and Metaserver (http://bioinfo.pl). Although the foundations of many of the databases and programs have already existed prior to the structural genomics period, the structural build-up and the increased availability of user-friendly tools for easy interface accessibility would not have been possible without the demands and execution of structural genomics. PCR Cloning and Protein Production. When structural genomics consortia were forming and the ambitious plan of going from gene to structure in a high throughput manner was

Reviews

proposed, the difficulty of protein production was considerably underestimated. Initially, the rate-determining steps were projected to be at the crystallographic level. Most had contended that robust X-ray data collection devices and sophisticated computational methods for X-ray crystallography were most important for development and would be the bottleneck step for protein structural determination. As the result, protein crystallographers were projected to be the highest in demand. On the contrary, after a couple of years into many structural genomics programs, the demand for protein crystallographers was eclipsed by the need for molecular biologists and protein biochemists involved in protein production. Surprisingly, the cloning and protein expression steps were problematic, and it was discovered rather quickly that most recombinant proteins do not express well and are difficult to purify. Consequently, fewer proteins than expected were made available for crystallization trials and those that crystallized were not suitable for X-ray crystallography. So for an extended amount of time, crystallographers were underused while the technology for recombinant protein production was being further developed. As a result, high throughput cloning and recombinant expression tools were generated at an accelerated pace, whereas they would not have been otherwise developed as quickly without the existence of structural genomic programs. Structural genomics facilities and related commercialized endeavors have fabricated automated platforms for executing the essential molecular biology steps required to subclone open reading frames (ORFs) directly from genomic DNAs and/or cDNA libraries. The general task was to insert the targeted coding sequences into expression vectors, undergo transformation in a high expression host (i.e., E. coli), and screen for their ability to express the target protein in a soluble form. One of the most common vector systems used was the T7 RNA polymerase-dependent E.coli expression vectors system (pETvectors)25 (Novagen, Madison, WI). Structural genomic centers and industry have engineered many variations of prokaryotic expression plasmids to accommodate recombinant expression in different organisms. These efforts have led to very useful expression plasmids such as pET100/D-TOPO (Invitrogen, Carlsbad, CA), pSUMO, 26 and many other plasmid constructs. Most of the expression vectors used incorporated a N- or C-terminal tag such as His6 to facilitate subsequent purification with a metal affinity chromatography column such as Ni2+ or Co2+. All the steps required to go from PCR primers to transformed host cells (e.g., BL21(DE3)) were conducted with robotic manipulations in 96-well format coupled to bar code tracking of sample and reagents. The NYSGRC and SECSG consortiums, for example, used the Beckman Biomed FX Robot, whereas the Protein Structure Factory used the Zinsser Speedy pipetting robot. Even though many procedures have been described as fully automated, there is still a large degree of offline manipulation using manual multichannel pipetting in smallscale purification such as those performed with the Millipore Ni2+ Metal Chelating Zip Tips. Small scale purification is usually concurred by SDS-PAGE analysis and/or mass spectrometry (e.g., MALDI-MS). The entire process can be comfortably conducted by 2 trained people testing for 96 clones within 2 weeks with success rates generally ranging from 25 to 50% in respect to the number of robotically engineered expression clones yielding soluble proteins with correct apparent masses. The lower range usually represents high-throughput expression screens with eukaryotes such C. elegans and the higher range corresponds to prokaryotic bacteria and archaea such as M. pneumoniae and P. furiosis, respectively.

Reviews

The most noteworthy automated large scale expression system was the high-throughput bacteria fermentation system fabricated by The Scripps Research Institute in collaboration with the Genomics Institute of the Novartis Research Foundation. This system allowed high-throughput large-scale protein expression in E. coli to occur in 96 tubes at one time, allowing an average of 7 mg of recombinant protein to be produced per tube for evaluation (U.S. Published Patent Application 2002/0146818). The GATEWAY system (Invitrogen Life Technologies, Carlsbad, CA, USA) was successfully used for expressing eukaryotic gene for structural genomics.27 This system was very effective in using clones from C. elegans to perform two-hybrid analyses to screen for protein partners in different biological processes. Full length clones tagged with maltose binding protein (MBP) or hexahistidine have been the most commonly assayed for expression and solubility in E. coli. The greatest strength of this system was that it allowed in vitro recombination reactions to occur allowing rapid transfer of eukaryotic open reading frames into multiple destination vectors. Consequently, the number of types of expression constructs that can be examined for expression and solubility can be maximized. In the case of C. elegans, structural genomic centers have reported great success in expressing eukaryotic genes in bacterial systems. The manipulations of the cloning procedures were performed either manually or by robotic means (e.g., BioRobot 9600 (Qiagen, Valencia, CA, USA)). The recombinant expression vectors exploiting the metabolically induced trc promoter were also constructed and well-used in structural genomics for high-throughput protein production of human and other eukaryotic proteins (e.g., pTrcHis (Invitrogen, Carlsbad, CA)). The expression system does not require culture monitoring, induction, and troublesome refolding procedures. Protein Crystallization. Protein crystallization has been and will continue to be the bottleneck step in structure determination for macromolecular crystallography. The most common methods for protein crystallization used in structural genomics are batch and vapor diffusion equilibration of purified protein screened against different precipitating solutions. The approach has been to quickly explore as many different precipitants to statistically sample crystallization space at random points by factorial28 or sparse matrix29 screens. One of the earliest contributions and of most significant impact to structural genomics is the availability of commercialized screening kits. In the early 90s, Hampton Research was the first to make available prepared precipitating reagents based on sparse matrix searches and other crystallization tools that allowed a broad field of biology disciplines to be able to crystallize proteins. Thereafter, an explosion of three-dimensional structures of proteins were determined and deposited into the protein data bank. Other companies followed suit years later, such that a variety of commercialized kits and apparatuses are now at the disposal to the structural biology community. Two important criteria were demanded during the course of structural genomics protein crystallization: amount of time for crystallization setups and minimal usage of protein material. Both of these requirements were answered by the fabrication of robotic automation and micro- and nanovolume crystallization screening. As the result, the amount of crystallization volume has decreased to as little as 10 nL, whereby a fraction of a milligram of protein is sufficient to screen against 96 crystallization conditions.30–32 Robotic systems can now prepare hundreds of samples per day for fast and efficient screening (e.g., IMPAX and ORYX (Douglas Instruments Ltd., East

Crystal Growth & Design, Vol. 7, No. 11, 2007 2229

Garston, U.K.), mosquito (TTP LabTech, Melbourn, U.K.), Cartesian, Honeybee (Genomic Solutions, Ann Arbor, MI). In recent years, nonconvective crystallization has also been prepared in a mirofluidic or capillary format where crystals can be analyzed in situ without invasive manipulations.33,34 In this respect, the counterdiffusion method has been the most useful and cost efficient for optimizing known crystallization conditions in restricted geometry where crystals are obtained in a supersaturated gradient.35,36 The techniques described here have proven to be invaluable in high-throughput mode and the structural genomics programs have provided opportunities to fabricate widely accessible, cost-effective, and user-friendly instrumentation for the general structural biology community. As hundreds of crystallization trials can be prepared quickly and efficiently, the challenge of imaging and evaluating the crystal quality became evident. The quality of protein crystals has been evaluated by different criteria. Visual inspection is initially performed to determine the habit and size of the crystals. Even though this is the least reliable method of evaluation to determine if a crystal is suitable for X-ray diffraction, it allows the experimenter to decide which crystals should be further analyzed. Crystal quality is mostly determined by how well the crystal diffracts X-rays in terms of its diffraction intensity over background noise (I/σ), resolution limit, and mosaicity. In some cases, one can even go as far as measuring its phasing power of intrinsic atoms that provides anomalous diffraction signals.37 In practice, crystals are mounted and their quality is determined directly by X-ray diffraction for diffraction limit at 3 Å or better (based on I/σ and the ability to index diffraction spots). There are no automated methods to evaluate the crystal quality, as the experimenter must decide on a case by case basis which crystals should be abandoned and which should undergo complete data collection, analysis, and structure determination. Storage chambers for high-throughput visualization have been made commercially available to quickly record high-resolution images of crystallization images in a quick systematic manner. Examples of these apparatuses include the Crystal Farm (Bruker AXS, Karlsruhe, Germany) and Rock Imager (Formulatrix Inc., Waltham, MA), where hundreds of crystallization trays can be stored under temperature control and programmed for optical visualization and recording of each crystallization droplet as a function of time. Structural genomics has inspired two methods for quick crystal analysis. The first is crystallization by high-throughput counterdiffusion in a capillary.33,34 Protein crystals can be grown by interfacing a concentrated protein volume to diffuse against a precipitant solution such that a supersaturation gradient is created along the length of the capillary.38 Protein crystals can grow along the length of the capillary such that the crystal quality is a function of distance away from the protein– precipitant interface. The capillary chamber can be directly placed in the X-ray beam and each crystal in the capillary can be diffracted in situ without physical manipulation. The best crystals can be immediately determined by examining the diffraction quality as described above. The second method can be performed by a recently fabricated commercial instrument (PX Scanner) by Oxford Diffraction (Abingdon, U.K.). This instrument combines optical inspection and in situ X-ray diffraction of proteins grown in multiwell plates. This single compact instrument provides a means to obtain information on the diffraction quality of crystals prior to mounting. X-ray Data Collection. Almost all protein crystallographic structures are determined from diffraction data collected from synchrotron X-ray radiation. There are currently more than 14

2230 Crystal Growth & Design, Vol. 7, No. 11, 2007

active synchrotron facilities and 80 beamlines in the world (http://biosync.sdsc.edu/international.html) where structural genomics data are collected. The synchrotron X-ray beam is very fine, intense, and focused compared to that of the home laboratory source such that crystals as small as 10 µm can provide adequate X-ray data for structure determination. Because of the high intensity and energy of synchrotron X-ray sources, the diffracting crystals must be analyzed under cryogenic conditions (temperatures of less than 100 K). The rate-limiting step between successful crystal growth and crystallographic data computation of a three-dimensional model is mounting the crystal onto an amorphous holder such as a nylon or plastic loop. Presently, there is no automated method to select and carefully transfer a protein crystal from the crystallization solution onto a crystal loop. In cases where protein crystals are not produced in a cryotolerant solution, one must still manually transfer the crystals from the mother liquor and soak them in a proper cryogenic solution prior to their loop mount. Once protein crystals are mounted and stored in liquid nitrogen, there are robotic systems that will automatically mount prepared crystals, center them in the line of the X-ray beam, perform initial diffraction analysis, and allow the user to determine if the crystals are deemed useful for complete data collection. The Stanford Synchrotron Radiation Laboratory (SSRL) automated mounting (SAM) system is an example of a very effective robotic crystal mounter. It uses a high-capacity cassette capable of storing 96 copper pins in a liquid nitrogen dewar. The cassette can be positioned with a robotic retriever such that each pin can be taken out and mechanically placed and centered in front of the X-ray beam under cryogenic conditions. Adjustments of the crystals and subsequent data collection are controlled by beamline software coupled to a screening system database with a web-based interface. Data collection can thus be performed on site or remotely in most cases. Similar systems have been constructed at other synchrotron sources where crystals preloaded in pucks or canisters are mounted by robotic means and crystal screening followed by data collection are performed by automated methods.39,40 Pipeline Crystallography. The field of structural biology has advanced tremendously in terms of how long it takes to produce a crystallographic structure. Compared to the 22 years required to solve the hemoglobin structure,41 the three-dimensional structure of a protein can now be obtained within a few hours starting from a well-diffracting crystal. Here, we point out some examples of computational development inspired or assembled from structural genomics that linked high-speed crystallographic packages to automatically compute processed intensity data to electron density calculations. Program packages such as SOLVE/RESOLVE42 has been instrumental for structure solution, phase improvement, and model tracing. Similarly, the ARP/wARP distribution43 combines automated model building with REFMAC44 restrained refinement. One of the first high-throughput and robust crystallographic pipelines was developed by the Southeast Collaboratory for Structural Genomics (SECSG) in which these programs were linked together to perform efficient ab initio phasing or molecular replacement. The SOLVE/RESOLVE combination was implemented with ARP/wARP as well as with ISAS45 and DM.46 Two molecular replacement packages, AMoRe47 and PHASER. 48 have also been coupled together for automation. This pipeline was termed SCA2STRUCTURE49 and allowed the submission of processed and scaled data sets to produce electron density maps traced with an initial model by uninterrupted crystallographic calculations (http://www.

Reviews

secsg.org/methods.htm). The first application of the SCA2STRUCTURE pipeline produced the crystal structure of a protein target from the Pyrococcus furiosus genome within 4.5 h. The structure was solved using anomalous diffraction data collected at the SERCAT synchrotron beamline, Argonne Advanced Photon Source (Chicago, IL). The development of the SCA2STRUCTURE pipeline led to the design of SGXPro,50 which acts as a workflow engine for managing communication automatically between different crystallographic processes. The program package allows the user to choose a palette of crystallographic programs in combinatorial manner that includes the choices of 3DSCALE, SHELXD, ISAS, SOLVE/RESOLVE, DM, SOLOMON, DMMULTI, BLAST, AMoRe, EPMR, XTALVIEW, ARP/wARP, and MAID. The execution is made to be performed with a userfriendly interface in a plug-and-play design allowing program integration at the user’s specification. Overall Cost and Impact. The average annual budget for a structural genomic center is about $5 million dollars, not accounting for capital equipment expenses and synchrotron beam time. Since structural genomics projects began, the price per structure was on the average $200 000, and more recently, the lowest cost was reported to be less than $100 000.1 In comparison, the cost per structure solved by traditional methods was estimated to be as high as $300 000. The average cost for each PDB submission by structural genomic centers was about 50% lower than those performed by traditional laboratories, rendering structure determination overall more cost effective for structural genomics. If novel structures were considered, especially for those placed in a new Pfam family or SCOP superfamily, then the prices per structure would be considerably higher. Nonetheless, the cost of novel structures contributed by structural genomics compared to traditional methods is still on average about 20% lower. In the first phase of structural genomics, many resources were invested into advancing technology in computation, protein expression, crystallization, and X-ray crystallography, as previously described. These advancements have not been completely exploited, as many structural genomics centers have only recently entered their second phase of operation. Small Genomes and No Automation How can a small laboratory group with a limited budget do structural genomics and provide an impact in structural biology? The real limiting factors for a small team to do structural genomics are cost and feasibility. If costs are reduced and feasibility is increased, a team consisting of as few as two people and a small genome can efficiently clone, express, crystallize, and determine three-dimensional structures of proteins for less than $200 000 within 2 years (Table 2). We highlight key factors that will effect feasibility and cost while providing examples from our own structural genomic activities. 1. Genome Size and Impact. One important factor to consider when starting a modest structural genomics project is the genome size. The maximum number of coding regions that one can feasibly analyze and select for cloning without any automation is about 1500 for two people working efficiently. We have derived this number from empirical findings from our laboratory and from experiences of others. One pair of competent hands can comfortably manipulate and follow through all the necessary procedures required for PCR cloning to smallscale protein production using initially eight 96-well plates of cloning reaction mixtures. In this respect, 1500 coding targets can be initially prepared by two people to perform all subsequent

Reviews

Crystal Growth & Design, Vol. 7, No. 11, 2007 2231 Table 2. Costs of Materials and Equipment for SG Mini-Pipeline

stage of pipeline

equipment

materials

persons (time required)

material costs genomic DNA preparation, $1000–$1500 (cost to sequence small genome if sequence is not available, $25 000)

genomic DNA preparation/target gene acquisition

standard laboratory equipment

DNA extraction kit

1 individual with moderate molecular biology experience (1–2 weeks)

genome assembly and annotation/ target validation

genome assembly and annotation software

no laboratory materials needed

1 individual familiar with software license fees, genome assembly and can use $0–$2500 the software (1–2 months)

cloning of targets

standard laboratory equipment must include 96-well cycler and multichannel pipettes

ligation-free cloning kits, vectors, plasmid mini-prep kits, and standard molecular biology supplies, PCR primers, and 12-lane cloning grilles

1–2 individuals with moderate oligomer synthesis, experience (2–6 months) $10 000–$20 000; cloning materials, $15 000–$25 000

small-scale expression trials

standard laboratory equipment, multichannel pipettes

standard laboratory supplies, growth media, antibiotics, induction reagents, and 96-well blocks

2 individuals with moderate experience (6–12 months)

solubility/expression screening

standard laboratory equipment plus affinity purification equipment, and electrophoresis system designed to load with multichannel pipettes

standard laboratory supplies, affinity purification materials, gels for multichannel loading, and some additional plastic ware

1–2 individuals with moderate affinity purification experience (6–12 months) materials, $10 000–$15 000; other materials, $5 000–$10 000

large-scale protein production and purification

standard equipment must include standard materials and reagents modern FPLC plus all equipment needed for large-scale expression and 2 large multiplace shaker/ incubators, and efficient equipment for cell lysis

2–3 individuals producing 8–12 proteins a weeks (life of project)

$20 000–$30 000

crystallization trials

standard equipment, lowtemperature incubator, appropriate microscope, multichannel pipettes

1 person to concentrate and set up 8–12 proteins for crystallization a week and track results daily (life of project)

$10 000–$20 000

data collection and processing

home X-ray source or access to a cryo materials, loops and beamline, software pipeline, reagents, heavy atom dewars, crystal retrieval derivatives, etc. and mounting equipment, cryo-tools and equipment

1 individual that can mount and collect and process data including preparation of HA derivatives, etc. (life of project)

$35 000–$50 000 annually

structure determination software, access to automated and refinement structure determination pipeline

standard materials plus 96-well sitting drop plates, concentrators, commercial screening kits, additives

no laboratory materials needed

1 person with the ability to efficiently solve and refine protein structures (life of project) * Estimates based on small genome, about 2 Mbp and/or 96–576 targets for a period of 1–2 years

steps for protein expression screening without detrimental errors. The targets described here are assumed to be all soluble proteins derived from approximately 75% of the coding regions of the majority of genomes. Therefore, a reasonable size genome that would account for screening a maximum of 1500 open reading frames would be about 2 million nucleotides or less. The impact of structural genomics on small genomes would be significant and, compared to large eukaryotic genomes, the diversity is abundant. If we consider the smallest genomes such as in viruses to those of prokaryotes, the organisms that contain these genomes represent the most abundant biological entities and biomass on Earth.51 Small genomes to-date have already provided an immense amount of information in terms of revealing new folds related to novel structure–function relationships including human pathogens and viruses.52,53 Minimalistic genomes from bacteria and archaea have contributed significantly toward filling in protein fold-space,54,55 understanding protein thermal stability,49,56 and providing insight into the structural evolution of proteins.57 Whole genome comparison of bacterial mutants coupled to structural genomics has provided a structural basis for the understanding of antibiotic resistance,58 spore formation, and virulence, 59 which are important in terms of human health, bioterrorism defense, and agriculture.

media and regents, $5 000–$10 000; plasticware. $2 000–$5 000

software license fees: $2500–$10 000

2. Sequence Data Management and Analysis. Using a single PC of reasonable speed and adequate memory space, two good size monitors (19–20 in.) and an internet connection, a single person with minimum programming skills can annotate, organize, and track small genome sequences. Public databases contributed from large genomics centers have archived biological data and made its information available worldwide to be interfaced with GUI-based software packages as described earlier. Although licensing fees are often required, more packages are being made open-source and web-based. We have used BASys, a web server for automated bacterial genome annotation,60 structural genomics target selection suites,61 and PCR primer design programs62 to analyze a 2 million nucleotide genome. Recently, complete user-friendly workflow platforms have been made available open-source, which offer a comprehensive data management system to store everything from target selection, primer design, and cloning to protein expression and crystallization trials.63 These program packages can be easily installed on most desktop (or even a laptop) computers, and sequence data can be managed at subsequent stages of the cloning and expression pipeline by the same individuals performing the technical tasks.

2232 Crystal Growth & Design, Vol. 7, No. 11, 2007

In our particular case, we have sequenced and annotated the entire genome of a hyperthemophilic archeon, Thermococcus thioreducens (T. thioreducens)64 containing 2 064 821 bp. We have conducted bioinformatics analysis of the T. thioreducens genome using the Softberry program suites (www.softberry.com) to reveal a variety of biochemical properties and predictions. Predictions of 35 parameters have been correlated with the protein expression results from 2301 ORFs. The most prominent parameters were found to be hydrophobicity related (e.g., signal peptide and transmembrane helices). Overall hydrophobicity is one of the most important factors in order for an ORF to yield a soluble expression product. Low hydrophobicity favors soluble expression. Signal peptides and transmembrane domains have a detrimental effect on soluble expression. These preliminary results provide significant experimental validation for using hydrophobicity in bioinformatics analysis. In the cloning and expression of C. elegans proteins by the SECSG consortium, it was demonstrated that of 87 genes cloned, expression was not observed for the 19 genes with the highest hydrophobic values. The more negative the hydrophobic value becomes, the more likely it is that an ORF will exhibit protein expression. Although protein expression was observed for the majority of the genes with low hydrophobicity measures, not all were soluble. Soluble expression depends on other factors and cannot be accurately predicted by bioinformatics methods alone. Thus, in practice, empirical screening appears to be the only reliable way to identify the ORFs that can be expressed in a soluble form. The rare codon issue is relevant to archaeal protein expression in E. coli that lack archaea specific tRNAs for some of the amino acids, such as Arg, Ile, Leu, Gly, and Pro. In our case, the data for 2301 ORFs revealed rare codons that would subsequently affect protein expression. This points toward the use of codonplus strains, such as Rosetta competent cells (Novagen, Madison, WI), to improve protein expression levels. Manual Mode over Automation. There is no doubt that robotics and automation have expedited the structural genomics pipeline. High-throughput preparation at the level of cloning and crystallization setup has had the highest demand in repetition and precision. In the case where the number of cloning ORF targets are in the range of 2000 or less, it is very feasible to perform all necessary steps using manual methods exploiting the use of multichannel pipettemen. Historically, early structural genomics projects did not have available any automated systems but instead relied on a large number of skilled undergraduate and graduate students.65,66 In the absence of automation, the cost factor is reduced tremendously such that a modest budget can still be implemented for an effective structural genomics pipeline. The estimated cost and materials required for a small laboratory with standard laboratory equipment to do structural genomics is outlined in Table 2. Provided with a small genome (approximately 2 million bp or less), a laboratory can go from gene to structure within 2 years for less than $200 000. The process would require at least 2 people with standard laboratory equipment with the following technical considerations on common rate-limiting steps. 4. General Cloning and Expression Strategy. PCR amplification and subsequent manipulation are all performed in 96well format. Oligonucleotide primer pairs directed against targeted open reading frames are prepared in 96-well stock trays usually provided by the oligonucleotide synthesizing companies. One of the most time consuming steps in a structural genomics pipeline is integrating the amplified product into proper expression vectors by restriction digest and ligation reactions followed by transformation into its appropriate propagating host. We have

Reviews

found directional cloning methods that use in vivo homologous recombination to be the most efficient and cost effective because cloning can be accomplished without purchasing special vectors or expensive enzymes. Commercial kits are currently available for this type of cloning (e.g., Xi clone (Genlantis, San Diego, CA)). The whole process can be accomplished rapidly using multichannel pipettes to dispense multiple reactions and bulk transformation in competent cells. Transformed cells with antibiotic resistant markers can be plated on 12-lane reservoirs containing LB agar. Use of 12 lane reservoirs allows for the selection of positive transformants while maintaining the geometry needed for multichannel pipeting. In our laboratory, the 5′ and 3′ extreme DNA ends flanking the open reading frames has been designed to contain an overlapping restriction recognition site with that of the expression vector. The flanking sequences will be homologous to the end sequences of a prepared linear expression vector. Topoisomersase III in association with Klenow DNA polymerase is used to render intrinsic strand displacement and minor exonuclease activity such that when the ends of two linear DNA fragments have homologous sequences, they are paired. Consequently, the DNA products are converted into circular plasmids when transformed into bacteria. We have previously used pET vectors (Novagen, Madison, WI) for successful expression of hyperthermophilic enzymes. No tagging was necessary because a heat selection step was included as part of the initial purification procedure, as will be described below. When the synthesized DNA fragment and the prepared linear expression vector are mixed and transformed into competent E. coli cells (e.g., DH5R), an intrinsic recombinase activity joins the two DNA fragments, resulting in a circular expression plasmid containing the coding region of the targeted protein. Thus, cloning can be performed without the use of overnight ligation or restriction enzymes. 5. Specific Example of Protein Expression and Solubility Profiling. We have used a hyperthermophilic archaeal genome to demonstrate the efficiency of the cloning and expression procedure. In this particular case, the recombinant proteins were initially screened for heat resistant and solubility. All the manipulations described here were executed with multichannel stepper micropipetters. In 0.2 mL 96-well plates, linear synthesized DNA fragment and expression vectors were placed in each well and competent E. coli cells were mixed for direct DNA uptake. Transformed cells were grown directly in each well with 0.1 mL of culture medium for 24–48 h to propagate the recombinant plasmid. Thereafter, the expression plasmids were purified by standard plasmid mini-preparation procedures. The purified plasmid was further transformed into an expression host and plated in 12-lane reserviors on LB-agar media in the presence of antibiotics. Individual colonies were then selected and used for small-scale expression trials with IPTG induction. The protein-expressing cells were disrupted using lysozyme and centrifuged with a 96-well plate holder rotor. The resulting supernatant was heated at 75 °C for 15 min, and any precipitate formed was removed by further centrifugation. Targeted proteins were usually heat stable and, if folded properly, were not likely to precipitate during the heat purification step. The remaining supernatant was analyzed on SDSPAGE to validate the degree of protein overexpression and molecular weight. It is advantageous to select a genome that expresses proteins of outstanding biophysical properties. In our case, recombinant proteins were heat resistant and therefore their intrinsic biochemical characteristics were exploited. It is also feasible at

Reviews

this stage to perform other types of biophysical selection such as pH or salinity. If affinity tags are used in the cloning procedures, then special pipette tips containing affinity resins can also be utilized for small-scale purification Ni2+ Metal Chelating Zip Tips (Millipore, Billerica, MA). We have also found the E-page electrophoresis system (Invitrogen, Carlsbad, CA) to be particularly useful in analyzing a large number of samples on SDS-PAGE to determine recombinant overexpression. In terms of the number of ORF cloned for the expression of soluble proteins, about 33% of the targeted coding regions showed significant overexpression of soluble and heat resistant proteins. This percentage was comparable to the average cloning expression success rate of large scale structural genomics centers. 6. Expression Level. The level of expression for each protein in a small structural genomics project is evaluated in a similar way as those of the large centers by categorizing protein expression levels into three separate groups. The first group includes those proteins that are expressed at high levels and where the resulting recombinant protein represents more than 20% of the soluble protein. The second group are those proteins that are expressed in low quantities (less than 20% of the soluble protein) but that can be purified in significant amounts (5 mg or more per liter of culture). The third group includes those proteins that are observed to be spontaneously degraded, precipitated, or have a high degree of aggregation. These characteristics can be determined by SDS-PAGE and lightscattering analysis. The data are recorded in terms of number of targets and the percentage of targets that are placed in each category. Within the scope of our mini-structural genomics pipeline, only the proteins belonging to the first group are pursued. Because soluble protein expression levels often diminish over time if the transformed host cells are stored frozen, the expression trials are usually conducted after fresh transformation of plasmid into competent E. coli. Induction of recombinant expression at 18 °C should also be experimented with, as previous proteomic studies have shown that a slower protein production rate facilitates soluble expression in E.coli, perhaps by decreasing the kinetics of folding and assembly. The solubility properties of recombinant proteins sometimes change during the course of purification. This problem is often observed when His-tagged proteins expressed in eukaryotic systems become more insoluble when subjected to purification by nickel-affinity, ion-exchange, or gel-column chromatography. Furthermore, these proteins also have a propensity to either aggregate or precipitate out of solution in an amorphous manner when purified at a nonoptimal pH or in the absence of detergents. The recombinant proteins that fall into this category should be tested for compatibility against a panel of buffers with different pH, detergents, and reducing agents. If favorable conditions are found to maintain or enhance protein solubility, the recombinant product should be re-expressed and purified under the new conditions. Detergents such as sarcosyl have been shown to work particularly well in mild concentrations (between 0.05 and 1%) to improve the behavior of solubility-challenged proteins. We have also observed in the past some proteins may be only soluble as a complex with another (macro) molecule, which is lost, along with the solubility, during the purification process. There have been accounts recently (unpublished) that overpurifying a protein in some cases may remove important contaminating proteins or peptides that may render the targeted macromolecules to be less soluble. This concept may be in contrary to the common belief that a homogeneous solution is

Crystal Growth & Design, Vol. 7, No. 11, 2007 2233

the most desirable. In particular, proteins that are very flexible or have nonrigid domains may have more favorable solubility in a heterogeneous solution than those that are ultra pure. Similarly, proteins with known binding partners (interactions with other cellular proteins) may be easier to purify as a complex and consequently more facile to manipulate in terms of subsequent crystallization. Therefore, solubility behavior as a function of protein–protein interactions should also be considered in terms of promoting favorable intermolecular contact points. 7. Large Scale Production. Proteins that show favorable small-scale expression are selected for large-scale production to produce 10 mg or more of protein in 1–2 L of culture growth. These are usually proteins that are placed in the first category. The complete cloning and small scale expression screening can be completed within 16 weeks (96 ORF per week) for 1500 coding regions. If protein solubility rescuing is not included, then within 4 months, the most promising targets for largescale expression are known. In more than 90% of the time in prokaryotic or archaea systems, the large-scale expression is similar to that of small scale in terms of yield and efficacy. Purification usually involves a primary selection step either by heat, salt, or pH (or His affinity tag) followed by ion exchange and gel filtration. Cell growth on the large scale can be accomplished in disposable 2 L beverage bottles67 or in standard glassware. The number of large-scale preparations that were required in our structural genomics project was 475 (31.6% of all cloned ORFs). Two people with 3 chromatography systems can comfortably purify 12 proteins a week. Each purification batch included 10–50 mg of protein that has been determined to be more than 90% homogeneous as determined by SDSPAGE. In our pipeline, two chromatorgraphic steps are routinely executed after an initial heat selection. This includes separation by ion-exchange and size-exclusion chromatography. Proteins are usually screened immediately for crystallization. Those proteins that do not eventually crystallize usually undergo a protein optimization or salvaging protocol. In the interest of speed, however, low-throughput manipulations are not usually carried out unless the protein target is of particularly high impact. 8. Protein Crystallization. There are a number of techniques to bring a protein solution to a supersaturated state, and the most commonly used are the batch and vapor diffusion methods (for review, see the literature68,69). Initial crystallization conditions have been successfully identified with “sparse-matrix” or “incomplete-factorial” screens28 that evaluate a range of pH and precipitant combinations. Recently, there has been a strong point of view that additives play more of a key role than the type of precipitants in successful protein crystallization. Thereby, screening different molecules that may contribute to key intermolecular interactions would be an effective screening strategy.69 Once initial conditions are found, we optimize crystallization by varying fine concentrations of biochemical parameters around the original crystal growth conditions by vapor diffusion or counter-diffusion methods 35,36 for in situ X-ray diffraction screening. One of our common preferred methods of finding initial crystallization conditions is using only one type of desiccant solution for a sitting droplet vapor diffusion screen against sparse-matrix precipitants.70 The Intelli-plate (Art Robbins, Sunnyvale, CA) is an excellent 96-array crystallography plate to be used for this purpose. The plates are visually transparent and birefringent free and have labeled wells for ease of

2234 Crystal Growth & Design, Vol. 7, No. 11, 2007

inspection and an appropriate geometry for easily looping crystals with standard equipment. A single person can accomplish setting up 96–288 protein sitting droplets within 15 min.71 The common reservoir solution in this case may be 40% PEG 6000, where the protein droplets can have different volume ratios of protein to precipitating reagents such as 1:2, 1:1, and 2:1. One tray set up can contain two standard commercialized screens (96 conditions) prefilled in 1.5 mL blocks. The best volume for each well is 2 µL of solution such that a set up of one array entails a reservoir solution of 100 µL of PEG solution with protein to precipitant ratios to be 0.67 µL + 1.33 µL, 1 µL + 1 µL, and 1.33 µL + 0.67 µL. In some cases, covalently labeling the protein with trace fluorescent probes may help distinguish protein crystals from non-protein crystalline or amorphous matters. Easy protocols are now available for quick fluorescence tagging that does not interfere with the crystallization process 72,73 and proves to be advantageous during the subsequent visualization. 9. X-ray Data Collection. There are two crystallographic bottlenecks in structural genomics. The first is the ability to nucleate and grow a protein crystal suitable for X-ray diffraction at a resolution of 3 Å or higher. When diffraction data are available, the second rate-determining step is obtaining accurate phases coupled with the diffraction intensities to calculate an interpretable electron density map. Although most diffraction data are collected on synchrotron beam lines, the home laboratory X-ray source is still useful for structure determination even though it is used more for diffraction prescreening processes. One of the most effective techniques in passing through the bottleneck steps is the counterdiffusion crystallization procedure using glass or nonbirefringent plastics in a cylindric restricted geometry with diameters of 0.2 mm or less and lengths as short as 2 cm. Details of the principles and procedures are described in other studies.35,36 During the diffusion process, a supersaturation state in respect to the protein can be attained and crystal formation is caused by the progression of a nucleation front resulting from the nonlinear interplay among mass transport, protein crystal nucleation, and growth.35 The biggest advantage is its ability to perform in situ X-ray data analysis, where crystals along the restricted geometry can be evaluated directly by X-ray diffraction. Consequently, complete data sets can be collected for the calculation of electron density maps without ever handling the crystals.34 Heavy atom derivatives and cryogenic reagents can be easily incorporated during the crystallization process without invasive manipulation. Moreover, many high X-ray scattering atoms such as iodine can provide anomalous signals for ab initio phasing at the home X-ray wavelength of 1.54 Å. In the case where a chromium X-ray source is available, anomalous sulfur phasing can be implemented without additional treatment of heavy atoms.74–76 Many thermophilic proteins from our structural genomics effort with SECSG were crystallized and their structures determined by this method. 10. Structure Determination. Small structural genomics groups do not necessarily need a dedicated synchrotron light source. Special beamline fees and travel cost to the synchrotron can thereby be reduced dramatically compared to conventional structural genomics data collection when limiting synchrotron data collection through the general application processes. X-ray generators and detectors such as those from MSC-Rigaku, MAR, and Bruker are able to record diffraction spots with high sensitivity and low background noise. Redundant data sets collected with these systems are now quite adequate to perform

Reviews

molecular replacement and multiple isomorphous replacement and even distinguish anomalous signals with a fixed wavelength home source for de novo phasing from certain heavy atom soaking. We would like to emphasize that cryogenic data collection is not mandatory. As mentioned previously, many crystals never make it to structure because of the invasive destruction of physical handling and temperature shock. In this respect, there is much value to consider with old crystallographic approaches in X-ray data collection where there were no extravagant looping and cryofreezing involved. In fact, most crystallographers are reluctant to admit that a significant number of protein structures can be determined using only room temperature data. When indexed and complete processed data set(s) are available, crystallographic computational pipelines are now used as described earlier in this review for data evaluation and structure determination. Whether one is performing molecular replacement or ab intio phasing techniques, robust crystallographic packages used in large structural genomic centers can easily be used for small teams as well. It is worthwhile to remind the reader the objective of structural genomics is determining 3D structures of proteins for data deposition to contribute to an unabridged dictionary of structural fold associated with functional families. The mentality is to pursue what works and save the rest for later. Therefore, a small team’s approach is similar, where only the proteins that overexpress, are soluble, crystallize, and provide usable X-ray data for structure determination are of interest. Overall, using our example genome, our rate of structure success from gene to structure has thus far been 12%. This rate is comparable to the average success values of large structural genomic centers that have been reporting 10–15%. Figure 1 exemplifies some of the major steps in our pipeline. Structural Genomics for All Yes, small laboratories can do structural genomics! Segelke et al.66 has already reported the feasibility of pursuing structural genomics on a single laboratory scale with 3–5 full-time employees and some robotics. Here, we point out that a team of as few as 2 people with conventional laboratory supplies and accessibility to a home X-ray source can reasonably perform bioinformatics, recombinant cloning and expression, crystallization, and X-ray crystallography with a genome size containing about 1500 coding regions for less than $ 200 000. The technology advanced and experience provided by large structural genomics centers have made it possible for biochemists and biologists to exploit the techniques of structural biology requiring only a broad general knowledge of the specific techniques involved. This approach is not applicable for insoluble proteins nor with targets that require involved salvaging procedures. It is feasible that small groups can afford to analyze a small size genome and obtain a similar percentage of success as large structural genomics groups for low-bearing fruit proteins. Both large centers and small groups will face unique challenges at each stage of the pipeline. Some of these challenges are compared directly with one another in Table 3. Presently, methods of genomic sequencing have advanced tremendously (e.g., 454 Life Sciences and Solexa-Illumina technologies) such that entire genomes can be deciphered very quickly and with a reasonable cost (as little as $25 000 for genomes of less than 5 million nucleotides). Consequently, new genome sequences are now obtained almost on a daily basis. This has provided, to some extent, a structural genomics freefor-all in which everyone’s efforts, including those of the small

Reviews

Crystal Growth & Design, Vol. 7, No. 11, 2007 2235

Figure 1. Representative steps of a mini-structural genomics project involving a hyperthermophilic genome and two people. (A) Cloning is performed by PCR amplification using selected oligonucleotides against targeted coding regions in a 96-well format. (B) PCR amplification is usually very specific and successful reactions are observed to occur more than 90% of the time, as shown when reaction samples are analyzed by agarose gel electrophoresis. The samples shown are run on 1% agarose, stained with ethidum bromide, and visualized under UV. (C) The PCR products are subcloned directly into an expression vector via homologous annealing and transformed into DH5R-competent E. coli cells without ligation or extensive restriction digest. The resulting plasmids are propagated and transformed into an expression host for small-scale expression trials. (D) Different clones are overexpressed and the fractions are heat-selected such that the remaining soluble protein fractions are analyzed on SDS-PAGE analysis (examples shown are in triplicates). Overexpression is clearly seen as the heaviest band fragment and usually comprises more than 50% of the total cellular protein at this step. Approximately one-third of the cloned fragments demonstrated strong expression and proceed to large-scale expression. Subsequently, they are purified to more than 90% homogeneity using ion-exchange or size-exclusion chromatography. (E) Purified protein fractions are screened for crystallization. Protein crystals that diffract to atomic resolutions are further collected for a complete data set and used for crystallographic structure determination.

laboratories, will be necessary to decipher the structure and function of these genes Small teams can focus on proteins that share a particular theme or function instead of pursuing structures of random open reading frames. Specific proteins, e.g., those related to pathways of metabolism or replication, may be excellent targets for a small structural genomics pipeline. Even proteins from viral genomes are of considerable value and easy to explore for small groups. We have intentionally not discussed membrane proteins because high-throughput membrane protein production and crystallization still requires significant effort and expense that would not yet be suitable for small structural genomics teams as described here. When all readily crystallizable macromolecules with known annotated functions have been exploited, the future of target selection for structural genomics mini-pipelines should include hypothetical proteins and protein complexes. Hypothetical proteins are those gene products that have no identifiable function and often exist as a significant fraction of the genome’s sequence. In many genomes, hypothetical gene sequences comprise of 10–30% of recognized open reading frames. In our case, the genome of T. thioreducens has more than 300 open reading frames that correspond to unknown functions. The unknown proteins may be orthologous in function to proteins of other related or nonrelated organisms or they may simply

represent gene products that are distinctive to a specific life form. The value of analyzing hypothetical proteins is to discover new protein folds associated with novel functions as well as studying their evolutionary significance in terms of finding ancient domains or molecular relics. Protein complexes are the wave of the future in structural genomics. A protein complex is a group of two or more associated proteins stabilized by favorable intermolecular interactions. In addition, complexes also include those proteins that are associated with nucleic acids such as DNA or RNA. The future of structural genomics is to understand biological processes such as signal transduction, cell-fate regulation, transcription and translations involving the interplay of hundreds of macromolecules. Recently, techniques have been developed on a proteome-wide scale to study physical binding between the constituents of proteins. These include the yeast two-hybrid assay (Y2H)77,78 that assesses whether two individual gene products interact using affinity purification techniques and mass spectrometry (APMS),79 which is an effective approach to identifying protein complexes containing more than two components. Although the latter has been used by many large centers with microarray technology, the most practical and feasible technique for small groups is the former. Using the Y2H system, hundreds of recombinant transformants (with each transformant expressing one of the open reading frames as a fusion to an

2236 Crystal Growth & Design, Vol. 7, No. 11, 2007

Reviews

Table 3. Large Centers and Small Teams Face Unique Challenges at Each Stage of the Pipeline stage of pipeline

SG center

small laboratory

bioinformatics

Funding supports a full bioinformatics team. Challenges exist in terms of effective communication between scientists and the bioinformatics team. Data must be managed and processed in a manner that is useful to the scientists

Funding may not support a full bioinformatics team. Strong need exists for open-source, user-friendly bioinformatics tools and data-management software

cloning

Cloning must be accomplished with extreme speed and efficiency. Expensive robotic machines are needed to automate the process

HTP cloning methods must be adapted to manual methods. Cloning efficiency is important but the speed is not as critical because of the protein production bottleneck

protein production/purification

Proteins must be purified rapidly without jeopardizing the success of future crystallization trails. Automation of this process is a challenge

Proteins must be produced and purified using only the available equipment

crystallization screening

Robotics are needed along with automated imaging and storage systems. Challenges exist in balancing volume reduction and the production of crystals suitable for X-ray diffraction

Small volume solution handling equipment usually is not available. Therefore. larger volumes must be used and crystallization trails must be set up and routinely examined manually

data collection

Automated crystal screening robots are needed. Equipment for automated crystal mounting is still lacking and in situ methods have not been fully implemented

Access to synchrotron radiation may not be available or too expensive. The home X-ray source must be used to its fullest. This produces challenges in terms of time need for data collection. HA derivatives and sulfur phasing methods must be employed

structure determination/ refinement

MAD phasing using Se-Met-labeled protein can be routinely used to acquire phase information. Challenges exist in increasing throughput of structure determination and simplifying the time-consuming iterative refinement process

Phasing methods make structure determination difficult and time consuming. Resolution limits of home sources and the requirement for increased exposure times produces challenges that carry over to structure determination and refinement

activation domain) can be screened with no automation to identify protein–protein interactions between full-length open reading frames predicted from an annotated genome sequence. Genomewide, two hybrid approaches have revealed novel interactions between proteins involved in related biological functions.78 The details of the Y2H technique will not be discussed here. In simple organisms (noneukaryotic), the isolation, preparation, and analysis of macromolecular complexes using Y2H systems can still be executed with limited personnel and modest budget. Conclusion Large structural genomic centers have been costly and arguably overpriced for the amount of returned benefits. However, the contributions in technology advancement and protein structural information inspired by or fabricated directly from structural genomics centers have been many and perhaps underappreciated. Among the important outcomes from structural genomics were robust computational tools that have given rise to powerful bioinformatics and crystallographic pipelines. Unique and efficient cloning systems have been developed to couple with new automation, allowing high volume cloning and efficient recombinant expression of targeted proteins. Finally, methodologies in protein crystallization and data collection have been devised to accelerate the approach of finding and optimizing crystallization conditions to obtain protein crystals suitable for X-ray diffraction. The ultimate gain has been speed as the path of acquiring gene to structure has never been so fast in the history of structural biology. The cost per structure deposition has also been reduced compared to traditional methods of protein structure determination. Structural genomics is not limited to large heavily funded centers. Small laboratories can also participate and provide an impact to the field of structural biology. A large number of genome sequences from diversified organisms are becoming more accessible, especially those from microorganisms and

viruses. Exploiting discoveries and utilizing technology from large structural genomics centers permit a team consisting of as little as two people and a small genome to practice bioinformatics, cloning and recombinant expression, protein crystallization, and crystallographic structure determination without automation. Acknowledgment. Many of the ideas reviewed here were derived from our experiences with the SECSG structural genomics program with the laboratory of Bi-Cheng Wang and from opinions of collaborators from other structural genomic centers. The development of a mini-pipeline structural genomics was supported in part by NSF STTR-05605 and NSF-EPSCoR (EPS-0447675). We are grateful to Jim Hudson for providing us the opportunity to obtain the complete genome sequence of T. thioreducens and Edward Meehan for his technical discussions. Our gratitude extends to Miranda Byrne, Ernie Curto, Owen Garriott, Trent Clinton Go, Damien Marsic, Barbara Pusey, Kimberly Seaman, and Brandon Steele for their contribution to this work. We thank Dawei Lin and Zheng-Qing Fu for their very helpful discussions in bioinformatics for structural genomics. Finally, we appreciate the valuable comments provided by Lynn Boyd and Maria Davis.

References (1) Chandonia, J.-M.; Brenner, S. E. The impact of structural genomics: Expectations and Outcomes. Science 2006, 311, 347–351. (2) Doyle, S. A. High-throughput cloning for proteomics research. Methods Mol. Biol. 2005, 310, 107–13. (3) Dieckman, L.; Gu, M.; Stols, L.; Donnelly, M. I.; Collart, F. R. High throughput methods for gene cloning and expression. Protein Expression Purif. 2002, 25 (1), 1–7. (4) Yokoyama, S. Protein expression systems for structural genomics and proteomics. Curr. Opin. Chem. Biol. 2003, 7 (1), 39–43. (5) McPherson, A. Protein crystallization in the structural genomics era. J. Struct. Funct. Genomics 2004, 5 (1–2), 3–12. (6) Yee, A.; Gutmanas, A.; Arrowsmith, C. H. Solution NMR in structural genomics. Curr. Opin. Struct. Biol. 2006, 16 (5), 611–7.

Reviews (7) Dauter, Z. Current state and prospects of macromolecular crystallography. Acta Crystallogr., Sect, D 2006, 62 (Pt 1), 1–11. (8) Wu, C. H.; Huang, H.; Yeh, L.-S. L.; Barker, W. C. Protein family classification and funtional annotation. Comput. Biol. Chem. 2003, 27, 37–47. (9) Wilson, D.; Madera, M.; Vogel, C.; Chothia, C.; Gough, J. The SUPERFAMILY database in 2007: families and functions. Nucleic Acids Res. 2007, 35 (suppl_1), D308–313. (10) Albeck, S.; Alzari, P.; Andreini, C.; Banci, L.; Berry, I. M.; Bertini, I.; Cambillau, C.; Canard, B.; Carter, L.; Cohen, S. X.; Diprose, J. M.; Dym, O.; Esnouf, R. M.; Felder, C.; Ferron, F.; Guillemot, F.; Hamer, R.; Ben Jelloul, M.; Laskowski, R. A.; Laurent, T.; Longhi, S.; Lopez, R.; Luchinat, C.; Malet, H.; Mochel, T.; Morris, R. J.; Moulinier, L.; Oinn, T.; Pajon, A.; Peleg, Y.; Perrakis, A.; Poch, O.; Prilusky, J.; Rachedi, A.; Ripp, R.; Rosato, A.; Silman, I.; Stuart, D. I.; Sussman, J. L.; Thierry, J. C.; Thompson, J. D.; Thornton, J. M.; Unger, T.; Vaughan, B.; Vranken, W.; Watson, J. D.; Whamond, G.; Henrick, K. SPINE bioinformatics and data-management aspects of highthroughput structural biology. Acta Crystallogr., Sect. D 2006, 62 (Pt 10), 1184–95. (11) Bairoch, A.; Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000, 28 (1), 45–48. (12) Peterson, J. D.; Umayam, L. A.; Dickinson, E. K.; Hickey, E. K.; White, O. The Comprehensive Microbial Resource. Nucleic Acids Res. 2001, 29, 123–125. (13) Baross, A.; Butterfield, Y. S. N.; Coughlin, S. M.; Zeng, T.; Griffith, M.; Griffith, O. L.; Petrescu, A. S.; Smailus, D. E.; Khattra, J.; McDonald, H. L.; McKay, S. J.; Moksa, M.; Holt, R. A.; Marra, M. A. Systematic Recovery and Analysis of Full-ORF Human cDNA Clones. Genome Res. 2004, 14 (10b), 2083–2092. (14) Tatusov, R. L.; Fedorova, N. D.; Jackson, J. D. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 2003, 4, 41. (15) Letunic, I.; Copley, R. R.; Schmidt, S.; Ciccarelli, F. D.; Doerks, T.; Schultz, J.; Ponting, C. P.; Bork, P. SMART 4.0: towards genomic data integration. Nucleic Acids Res. 2004, 32 (suppl_1), D142–144. (16) Delorenzi, M.; Speed, T. An HMM model for coiled-coil domains and a comparison with PSSM-based predictions. Bioinformatics 2002, 18, 617–625. (17) Apweiler, R.; Attwood, T. K.; Bairoch, A.; Bateman, A.; Birney, E.; Biswas, M.; Bucher, P.; Cerutti, L.; Corpet, F.; Croning, M. D. R.; Durbin, R.; Falquet, L.; Fleischmann, W.; Gouzy, J.; Hermjakob, H.; Hulo, N.; Jonassen, I.; Kahn, D.; Kanapin, A.; Karavidopoulou, Y.; Lopez, R.; Marx, B.; Mulder, N. J.; Oinn, T. M.; Pagni, M.; Servant, F.; Sigrist, C. J. A.; Zdobnov, E. M. InterPro--an integrated documentation resource for protein families, domains and functional sites. Bioinformatics 2000, 16 (12), 1145–1150. (18) Zdobnov, E. M.; Apweiler, R. InterProScan - an integration platform for the signature-recognition methods in InterPro. Bioinformatics 2001, 17 (9), 847–848. (19) Attwood, T. K.; Bradley, P.; Flower, D. R.; Gaulton, A.; Maudling, N.; Mitchell, A. L.; Moulton, G.; Nordle, A.; Paine, K.; Taylor, P.; Uddin, A.; Zygouri, C. PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res. 2003, 31 (1), 400–402. (20) Xenarios, I.; Salwinski, L.; Duan, X. J.; Higney, P.; Kim, S.-M.; Eisenberg, D. DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002, 30 (1), 303–305. (21) von Mering, C.; Zdobnov, E. M.; Tsoka, S.; Ciccarelli, F. D.; PereiraLeal, J. B.; Ouzounis, C. A.; Bork, P. Genome evolution reveals biochemical networks and functional modules. Proc. Natl. Acad. Sci. U.S.A. 2003, 100 (26), 15428–15433. (22) Lo Conte, L.; Brenner, S. E.; Hubbard, T. J. P.; Chothia, C.; Murzin, A. G. SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res. 2002, 30 (1), 264–267. (23) Pearl, F. M. G.; Bennett, C. F.; Bray, J. E.; Harrison, A. P.; Martin, N.; Shepherd, A.; Sillitoe, I.; Thornton, J.; Orengo, C. A. The CATH database: an extended protein family resource for structural and functional genomics. Nucleic Acids Res. 2003, 31 (1), 452–455. (24) Getz, G.; Starovolsky, A.; Domany, E. F2CS: FSSP to CATH and SCOP prediction server. Bioinformatics 2004, 20 (13), 2150–2152. (25) Studier, W. F. Protein Expression Purif. 2005, 41, 207–234. (26) Mossessova, E.; Lima, C. D. Ulp1-SUMO crystal structure and genetic analysis reveal conserved interactions and a regulatory element esential for cell growth in yeast. Mol. Cell 2000, 5, 865–876.

Crystal Growth & Design, Vol. 7, No. 11, 2007 2237 (27) Cusick, M. E.; Klitgord, N.; Vidal, M.; Hill, D. E. Interactome: gateway into systems biology. Hum. Mol. Genet. 2005, ddi335. (28) Carter, C. W., Jr.; Carter, C. W. J. Biol. Chem. 1979, 254 (23), 12219– 23. (29) Jancarik, J.; Kim, S.-H. J. Appl. Crystallogr. 1991, 24, 409–411. (30) Li, F.; Robinson, H.; Yeung, E. S. Automated high-throughput nanoliter-scale protein crystallization screening. Anal. Bioanal. Chem. 2005, 383 (7–8), 1034–41. (31) Fenglei, L.; Howard, R.; Edward, S. Y. Automated high-throughput nanoliter-scale protein crystallization screening. Anal. Bioanal. Chem. 2005, V383 (7), 1034. (32) DeLucas, L. J.; Hamrick, D.; Cosenza, L.; Nagy, L.; McCombs, D.; Bray, T.; Chait, A.; Stoops, B.; Belgovskiy, A.; William Wilson, W. Protein crystallization: virtual screening and optimization. Prog. Biophys. Mol. Biol. 2005, 88 (3), 285. (33) Ng, J. D.; Gavira, J. A.; Garcia-Ruiz, J. M. Protein crystallization by capillary counterdiffusion for applied crystallographic structure determination. J. Struct. Biol. 2003, 142 (1), 218–31. (34) Gavira, J. A.; Toh, D.; Lopez-Jaramillo, J.; Garcia-Ruiz, J. M.; Ng, J. D. Ab initio crystallographic structure determination of insulin from protein to electron density without crystal handling. Acta Crystallogr., Sect. D 2002, 58 (Pt 7), 1147–54. (35) Garcia-Ruiz, J. M. Counterdiffusion Methods for Macromolecular Crystallization. In Macromolecular Crystallography, Part C; Carter, C., Jr., Ed.; Academic Press: New York, 2003; Vol. 368, p 130. (36) Garcia-Ruiz, J. M.; Ng, J. D. Counterdiffusion Capillary Crystallization for High Throughput Applications. In Protein Crystallization Strategies for Structural Genomics; Chayen, N. , Ed.; International University Line: La Jolla, CA, 2006; Chapter 5. (37) Ng, J. D. Space-grown protein crystals are more useful for structure determination. Ann. N.Y. Acad. Sci. 2002, 974, 598–609. (38) Sauter, C.; Otalora, F.; Gavira, J. A.; Vidal, O.; Giege, R.; GarciaRuiz, J. M. Structure of tetragonal hen egg-white lysozyme at 0.94 A from crystals grown by the counter-diffusion method. Acta Crystallogr., Sect. D 2001, 57 (Pt 8), 1119–26. (39) Snell, G.; Cork, C.; Nordmeyer, R.; Cornell, E.; Meigs, G.; Yegian, D.; Jaklevic, J.; Jin, J.; Stevens, R. C.; Earnest, T. Automated sample mounting and alignment system for biological crystallography at a synchrotron source. Structure 2004, 12 (4), 537–45. (40) Cork, C.; O’Neill, J.; Taylor, J.; Earnest, T. Advanced beamline automation for biological crystallography experiments. Acta Crystallogr., Sect. D 2006, 62 (Pt 8), 852–8. (41) Perutz, M. F.; Rossman, M. G.; Cullis, A. F.; Muirhead, H.; Will, G.; North, A. C. T. Structure of haemoglobin. A three-dimensional Fourier synthesis at 5.5Å resolution, obtained by X-ray analysis. Nature 1960, 185, 416–422. (42) Terwilliger, T. C.; Carter, C. W., Jr.; Sweet, R. M. SOLVE and RESOLVE: Automated Structure Solution and Density Modification. In Methods in Enzymology; Academic Press: New York, 2003; Vol. 374, p 22. (43) Morris, R. J.; Perrakis, A.; Lamzin, V. S.; Charles W; Carter, J. a. R. M. S., ARP[+45 degree rule]wARP and Automatic Interpretation of Protein Electron Density Maps. In Methods in Enzymology; Academic Press: New York, 2003; Vol. 374, p 229. (44) Murshudov, G. N.; Vagin, A. A.; Dodson, E. J. Refinement of Macromolecular Structures by the Maximum-Likelihood Method. Acta Crystallogr., Sect. D 1997, 53 (3), 240–255. (45) Wang, B.-C. Resolution of phase ambiguity in macromolecular crystallography. Methods Enzymol. 1985, 115, 90–112. (46) Cowtan, K. D.; Zhang, K. Y. J. Density modification for macromolecular phase improvement. Prog. Biophys. Mol. Biol. 1999, 72 (3), 245. (47) Navaza, J. Implementation of molecular replacement in AMoRe. Acta Crystallogr., Sect. D 2001, 57 (10), 1367–1372. (48) McCoy, A. Solving structures of protein complexes by molecular replacement with Phaser. Acta Crystallogr., Sect. D 2007, 63 (1), 32– 41. (49) Liu, Z. J.; Tempel, W.; Ng, J. D.; Lin, D.; Shah, A. K.; Chen, L.; Horanyi, P. S.; Habel, J. E.; Kataeva, I. A.; Xu, H.; Yang, H.; Chang, J. C.; Huang, L.; Chang, S. H.; Zhou, W.; Lee, D.; Praissman, J. L.; Zhang, H.; Newton, M. G.; Rose, J. P.; Richardson, J. S.; Richardson, D. C.; Wang, B. C. The high-throughput protein-to-structure pipeline at SECSG. Acta Crystallogr., Sect. D 2005, 61 (Pt 6), 679–84. (50) Fu, Z.-Q.; Rose, J.; Wang, B.-C. SGXPro: a parallel workflow engine enabling optimization of program performance and automation of structure determination. Acta Crystallogr., Sect. D 2005, 61 (7), 951– 959. (51) Suttle, C. A. Viruses in the sea. Nature 2005, 437 (7057), 356.

2238 Crystal Growth & Design, Vol. 7, No. 11, 2007 (52) Campanacci, V.; Egloff, M. P.; Longhi, S.; Ferron, F.; Rancurel, C.; Salomoni, A.; Durousseau, C.; Tocque, F.; Bremond, N.; Dobbe, J. C.; Snijder, E. J.; Canard, B.; Cambillau, C. Structural genomics of the SARS coronavirus: cloning, expression, crystallization and preliminary crystallographic study of the Nsp9 protein. Acta Crystallogr., Sect. D 2003, 59 (Pt 9), 1628–31. (53) Terwilliger, T. C.; Park, M. S.; Waldo, G. S.; Berendzen, J.; Hung, L. W.; Kim, C. Y.; Smith, C. V.; Sacchettini, J. C.; Bellinzoni, M.; Bossi, R.; De Rossi, E.; Mattevi, A.; Milano, A.; Riccardi, G.; Rizzi, M.; Roberts, M. M.; Coker, A. R.; Fossati, G.; Mascagni, P.; Coates, A. R.; Wood, S. P.; Goulding, C. W.; Apostol, M. I.; Anderson, D. H.; Gill, H. S.; Eisenberg, D. S.; Taneja, B.; Mande, S.; Pohl, E.; Lamzin, V.; Tucker, P.; Wilmanns, M.; Colovos, C.; Meyer-Klaucke, W.; Munro, A. W.; McLean, K. J.; Marshall, K. R.; Leys, D.; Yang, J. K.; Yoon, H. J.; Lee, B. I.; Lee, M. G.; Kwak, J. E.; Han, B. W.; Lee, J. Y.; Baek, S. H.; Suh, S. W.; Komen, M. M.; Arcus, V. L.; Baker, E. N.; Lott, J. S.; Jacobs, W., Jr.; Alber, T.; Rupp, B. The TB structural genomics consortium: a resource for Mycobacterium tuberculosis biology. Tuberculosis 2003, 83 (4), 223–49. (54) Eisenstein, E.; Gilliland, G. L.; Herzberg, O.; Moult, J.; Orban, J.; Poljak, R. J.; Banerjei, L.; Richardson, D.; Howard, A. J. Biological function made crystal clear - annotation of hypothetical proteins via structural genomics. Curr. Opin. Biotechnol. 2000, 11 (1), 25–30. (55) Wang, B. C.; Adams, M. W.; Dailey, H.; DeLucas, L.; Luo, M.; Rose, J.; Bunzel, R.; Dailey, T.; Habel, J.; Horanyi, P.; Jenney, F. E., Jr.; Kataeva, I.; Lee, H. S.; Li, S.; Li, T.; Lin, D.; Liu, Z. J.; Luan, C. H.; Mayer, M.; Nagy, L.; Newton, M. G.; Ng, J.; Poole, F. L.; Shah, A.; Shah, C.; Sugar, F. J.; Xu, H. Protein production and crystallization at SECSG—an overview. J. Struct. Funct. Genomics 2005, 6 (2–3), 233–43. (56) Robinson-Rechavi, M.; Godzik, A. Structural genomics of thermotoga maritima proteins shows that contact order is a major determinant of protein thermostability. Structure 2005, 13 (6), 857–60. (57) Panchenko, A. R.; Madej, T. Structural similarity of loops in protein families: toward the understanding of protein evolution. BMC EVol. Biol. 2005, 5 (1), 10. (58) Hughes, D. Exploiting genomics, genetics and chemistry to combat antibiotic resistance. Nat. ReV. Genet. 2003, 4 (6), 432–41. (59) Au, K.; Berrow, N. S.; Blagova, E.; Boucher, I. W.; Boyle, M. P.; Brannigan, J. A.; Carter, L. G.; Dierks, T.; Folkers, G.; Grenha, R.; Harlos, K.; Kaptein, R.; Kalliomaa, A. K.; Levdikov, V. M.; Meier, C.; Milioti, N.; Moroz, O.; Muller, A.; Owens, R. J.; Rzechorzek, N.; Sainsbury, S.; Stuart, D. I.; Walter, T. S.; Waterman, D. G.; Wilkinson, A. J.; Wilson, K. S.; Zaccai, N.; Esnouf, R. M.; Fogg, M. J. Application of high-throughput technologies to a structural proteomics-type analysis of Bacillus anthracis. Acta Crystallogr., Sect. D 2006, 62 (Pt 10), 1267– 75. (60) Van Domselaar, G. H.; Stothard, P.; Shrivastava, S.; Cruz, J. A.; Guo, A.; Dong, X.; Lu, P.; Szafron, D.; Greiner, R.; Wishart, D. S. BASys: a web server for automated bacterial genome annotation. Nucleic Acids Res. 2005, 33 (Web Server issue), W455–9. (61) Rodrigues, A.; Hubbard, R. E. Making decisions for structural genomics. Brief Bioinformatics 2003, 4 (2), 150–67. (62) Chen, B. Y.; Janes, H. W.; Chen, S. Computer programs for PCR primer design and analysis. Methods Mol. Biol. 2002, 192, 19–29. (63) Prilusky, J.; Oueillet, E.; Ulryck, N.; Pajon, A.; Bernauer, J.; Krimm, I.; Quevillon-Cheruel, S.; Leulliot, N.; Graille, M.; Liger, D.; Tresaugues, L.; Sussman, J. L.; Janin, J.; van Tilbeurgh, H.; Poupon, A. HalX: an open-source LIMS (Laboratory Information Management System) for small- to large-scale laboratories. Acta Crystallogr., Sect. D 2005, 61 (Pt 6), 671–8.

Reviews (64) Pikuta, E.; Marsic, D.; Itoh, T.; Bej, A. K.; Tang, J.; Whitman, W.; Ng, J. D.; Garriott, O. K.; Hoover, R. B., Thermococcus thioreducens sp. nov., a novel hyperthermophilic, obligately sulfur-reducing archaeon from a deep-sea hydrothermal vent. J. Syst. EVol. Microbiol. 2007, in press. (65) Goulding, C. W.; Jeanne Perry, L. Protein production in Escherichia coli for structural studies by X-ray crystallography. J. Struct. Biol. 2003, 142 (1), 133. (66) Segelke, B.; Schafer, J.; Coleman, M.; Lekin, T.; Toppani, D.; Skowronek, K.; Kantardjieff, K.; Rupp, B. Laboratory scale structural genomics. J. Struct. Funct. Genomics 2004, 5 (1), 147. (67) Zhao, Q.; Frederick, R.; Seder, K.; Thao, S.; Sreenath, H.; Peterson, F.; Volkman, B. F.; Markley, J. L.; Fox, B. G. Production in two-liter beverage bottles of proteins for NMR structure determination labeled with either 15N- or 13C-15N. J. Struct. Funct. Genomics 2004, 5 (1– 2), 87–93. (68) McPherson, A., Crystallization of Biological Macromolecules. 1999. (69) McPherson, A.; Cudney, B. Searching for silver bullets: An alternative strategy for crystallizing macromolecules. J. Struct. Biol. 2006, 156 (3), 387. (70) Newman, J. Expanding screening space through the use of alternative reservoirs in vapor-diffusion. Acta Crystallogr., Sect. D 2005, 61, 490– 493. (71) Biertümpfel, C.; Basquin, J.; Suck, D. Practical implementations for improving the throughput in a manual crystallization setup. J. Appl. Crystallogr. 2005, 38, 568–570. (72) Forsythe, E.; Achari, A.; Pusey, M. L. Trace fluorescent labeling for high-throughput crystallography. Acta Crystallogr., Sect. D 2006, 62 (3), 339–346. (73) Groves, M. R.; Muller, I. B.; Kreplin, X.; Muller-Dieckmann, J. A method for the general identification of protein crystals in crystallization experiments using a noncovalent fluorescent dye. Acta Crystallogr., Sect. D 2007, 63 (4), 526–535. (74) Watanabe, N.; Kitago, Y.; Tanaka, I.; Wang, J.-w.; Gu, Y.-x.; Zheng, C.-d.; Fan, H.-f. Comparison of phasing methods for sulfur-SAD using in-house chromium radiation: case studies for standard proteins and a 69 kDa protein. Acta Crystallogr., Sect. D 2005, 61 (11), 1533–1540. (75) Yang, C.; Pflugrath, J. W.; Courville, D. A.; Stence, C. N.; Ferrara, J. D. Away from the edge: SAD phasing from the sulfur anomalous signal measured in-house with chromium radiation. Acta Crystallogr., Sect. D 2003, 59 (11), 1943–1957. (76) Xu, H.; Yang, C.; Chen, L.; Kataeva, I. A.; Tempel, W.; Lee, D.; Habel, J. E.; Nguyen, D.; Pflugrath, J. W.; Ferrara, J. D.; Arendall, W. B., III; Richardson, J. S.; Richardson, D. C.; Liu, Z.-J.; Newton, M. G.; Rose, J. P.; WangB.-C. Away from the edge II: in-house SeSAS phasing with chromium radiation. Acta Crystallogr., Sect. D 2005, 61 (7), 960–966. (77) Ito, T.; Chiba, T.; Ozawa, R.; Yoshida, M.; Hattori, M.; Sakaki, Y. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. U.S.A. 2001, 98 (8), 4569–4574. (78) Uetz, P.; Giot, L.; Cagney, G.; Mansfield, T. A.; Judson, R. S.; Knight, J. R.; Lockshon, D.; Narayan, V.; Srinivasan, M.; Pochart, P.; QureshiEmili, A.; Li, Y.; Godwin, B.; Conover, D.; Kalbfleisch, T.; Vijayadamodar, G.; Yang, M.; Johnston, M.; Fields, S.; Rothberg, J. M. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 2000, 403 (6770), 623. (79) Kumar, A.; Snyder, M. Proteomics: Protein complexes take the bait. Nature 2002, 415 (6868), 123.

CG700706A