Structural Analysis and Modeling of Proteins on the Web: An

Jun 6, 1998 - The online resources, description of the resources, and Web addresses of practicable resources are listed in Table 1. The order in which...
1 downloads 11 Views 84KB Size
In the Classroom edited by

Teaching with Technology

James P. Birk Arizona State University Tempe, AZ 85287

Structural Analysis and Modeling of Proteins on the Web An Investigation for Biochemistry Undergraduates Darryl León,* Sarah Uridil, and James Miranda Department of Chemistry and Biochemistry, California Polytechnic State University, San Luis Obispo, CA 93407

Teaching advanced biochemistry methods at undergraduate universities can prepare students for their anticipated careers as biochemists. Typically, advanced techniques such as PCR, DNA sequencing, Western blotting, and immunochemistry are introduced as standard methods for analyzing nucleic acids and proteins. One area currently being explored is how to successfully incorporate molecular modeling and related analyses into advanced chemistry (1) and biochemistry classes. Researchers use a variety of software packages for structural analysis of protein and nucleic acid sequences. However, these packages may be too complicated for undergraduates to learn within a few weeks and some require an expensive graphics workstation to be useful. One approach to circumvent the cost of these workstations and stay within limited departmental funds is to access sequence databases and structural analytical tools via the World Wide Web (Web). The Web is an excellent resource for college instructors to help students analyze nucleic acid and protein data. Additionally, this resource gives students the opportunity to practice utilizing the wide range of servers and databanks available to them even after they graduate. This approach has been used successfully in teaching an advanced biochemistry course (protein structure and folding) and was investigated further as an undergraduate research project. Methods The instructor showed the students how to access the Internet and how to use the World Wide Web as a resource *Corresponding author.

for protein analysis. The software chosen for browsing the Web was Netscape Navigator (v 2.0), and the platform selected was a Power Macintosh (Apple) computer. After the students were introduced to the home page of the university and were shown how to use the Web browser, they were given the Internet addresses to various databases for sequences and servers for structural analyses. Although a variety of structure analyses are available, only a consistent few were used by this group to study the protein sequence in question. Specifically, analytical resources that predicted sequence homology, sequence alignment, secondary structure, and tertiary structure were evaluated and compared based on their usefulness. The online resources, description of the resources, and Web addresses of practicable resources are listed in Table 1. The order in which an analysis takes place is important if the student is to learn how to investigate a protein structure or function.

Steps for Protein Structure Analysis An instructor can assist a student in protein analysis by following the flowchart shown in Figure 1. The instructor can either allow a student to select a general protein of his or her own choice or can compile a list of common proteins and have the student select from an abbreviated list. The student first searches for the common protein in the Protein Data Bank (2, 3) to determine if the 3-D structure is known. If the structure is known, the coordinates should be saved as a text file. Next, the protein sequence is located in either PIR or SWISS PROT databanks and also saved as a text file. The student then submits the entire protein sequence to a server that searches for other proteins having a similar amino acid sequence to the protein submitted. The homologous or most closely related proteins are usually the first to be identified; how-

Table 1. Partial List of Protein Ser vers on the World Wide Web Server/Softwarea

Description

Web Address (URL)

PDB

Database of experimentally determined protein and nucleic acid structures compiled at Brookhaven National Laboratory

http://www.pdb.bnl.gov/

SWISS-PROT

Database of protein sequences

http://expasy.hcuge.ch/www/tools.html

PIR

Database of protein sequences

http://www.gdb.org/Dan/proteins/pir.html

BioSCAN

A rapid search and sequence analysis

http://genome.cs.unc.edu/bioscan.html

BLITZ

Provides specific sequences most similar to or containing the most similar regions to a query sequence

http://www.ebi.ac.uk/searches/blitz_input.html

NNPREDICT

Predicts secondary structure

http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html

PHD

Automatic service for predicting a protein structure

http://www.embl-heidelberg.de/predictprotein/predictprotein.html

SWISS-MODEL (ProMod)

Experimental protein modeling server at the Glaxo Institute

http://expasy.hcuge.ch/swissmod/SWISS-MODEL.html

Rasmac (Mac version)

Molecular graphics program intended for visualization of proteins and related molecules

http://www.umass.edu/microbio/rasmol/

aPDB = Protein Data Bank; SWISS-PROT = Swiss Protein; PIR = Protein Identification Resource; BioSCAN = Biological Sequence Comparative Analysis of biological sequences Node; PHD = ProteinPredict.

JChemEd.chem.wisc.edu • Vol. 75 No. 6 June 1998 • Journal of Chemical Education

731

In the Classroom

ever, unrelated protein fragments are also listed for completeness. The student tries to find a unique polypeptide sequence that is a novel protein or protein fragment; it is important that the novel protein be a newly sequenced polypeptide because the structure of this novel protein is usually not known. Whether the novel protein is original can be assessed by ascertaining from the selected protein databank the date of sequence determination and deposition. Even short sequences of 20–40 residues may prove interesting to a student. If a novel protein sequence is not identified, the student must start the flowchart again from Step #1 (Fig. 1). If an appropriate novel polypeptide was selected and an alignment was not performed automatically by the homology search, the student can submit the novel polypeptide sequence and the original protein sequence to an alignment server. Having received a complete alignment, the student submits the new sequence and the corresponding original sequence to a secondary structure prediction server. Hence, students speculate, using a variety of secondary structure algorithms, what types of structural motifs can be found in the selected protein. If a variety of algorithms show both sequences to have similar secondary structure regions, there is a high probability that their tertiary structures will also be related. The student submits the novel polypeptide sequence to SWISS-MODEL, which uses the structure–relationship method for predicting 3-D structures of unknown peptide sequences. (The server also allows a user to identify a possible protein with a known structure to be used as a comparison for the novel protein that is being modeled.) If the computer analysis correlates to the sequence chosen, the server will automatically email the coordinates of the predicted model in a Protein Data Bank (PDB) file format to the student. The coordinates of the predicted model of the novel protein and the experimentally determined protein can be visualized using Rasmac (Apple Macintosh platform) or Rasmol (IBM compatible platform) software (4 ). Two other free visualization software packages available to students from the Web are Chemscape Chime from MDL Information Systems (http://www.mdli.com/ chemscape/chime/) and WebLab Viewer from MSI (http://www.msi.com/weblab/viewer/ index.htm). All these software packages and platforms can be easily operated by students. Having seen the coordinates, students then begin comparing the two structures in terms of overall shape, secondary structure motifs, and structural domains. The Rasmac software reads the PDB text file and produces a three-dimensional structure that can be rotated freely and manipulated using text commands. The software allows the student to switch from a wire-frame to a ribbon to a space-filling format in a matter of seconds or minutes, depending on the size of the file and the speed of the computer. The student is able to print the molecular models in color using a color ink-jet or laser printer, and the desired structures can be saved as PICT, GIF, and other formats. Additionally, the molecular models can be copied and pasted into word processing programs such as Microsoft Word (v 5.0 or higher).

Structure Analysis and Modeling of Proteins Using the approach described above, this advanced biochemistry method was successful in determining the 3-D structure of several different protein fragments. The results of one of these analyses will be discussed here. Based on sequence 732

Select a protein and search the Protein Data Bank (PDB).

1

Locate coordinates and save as a text file.

3

Is the protein structure known?

Yes

Is the protein found in the PIR or Swiss Protein databanks ?

Submit protein to PIR or Swiss Protein

No

No

Yes

5

Access and submit sequence of desired protein for homology search.

6 Submit sequence from original and novel polypeptide for secondary structure prediction.

No

Do aligned sequences have similar predicted secondary structures?

8

Download the predicted coordinates, access Rasmol, and open text files of new experimentally determined file of the original protein.

4

Is one of the aligned sequences a new or novel polypeptide?

Yes

Yes

Locate sequence and save as text file.

7

No

Submit amino acid sequence of novel polypeptide to SwissModel.

Yes

Was there a result?

No

9 No

Are the structures similar ?

Yes

Compare the structures.

Figure 1. Example flowchart of how an analysis of a novel protein is performed.

alignments, the DNA-binding regulatory protein called the lactose operon repressor protein was found to be similar to the ribose operon repressor protein. (Fortuitously the NMR structure of the lactose operon repressor protein had been just recently determined [5, 6]). One of the major structural shapes of DNA-binding regulatory proteins is the helix-turnhelix motif (7 ), which has a structural domain consisting of two successive α helices separated by a sharp β turn (8). Because the three-dimensional structure of the Escherichia coli ribose operon repressor has not yet been determined, the amino-terminal DNA-binding domain of the ribose operon repressor (9) was modeled after the lactose operon repressor (6 ). To accelerate the analysis process, a student was given the names of these two proteins and the student’s analysis began at this point. Structural analysis began with the student acquiring the amino acid sequences of both the ribose and the lactose operon repressor proteins from the PIR (10) and Swiss Protein (11) servers. A simple homology comparison was then carried out by submitting the first 100 amino acids of each protein to BioSCAN (12) and BLITZ (13). The greatest align-

Journal of Chemical Education • Vol. 75 No. 6 June 1998 • JChemEd.chem.wisc.edu

In the Classroom

A Ribose Lactose

A T M K D V A R L A G V S T S T V S H V * ~ * * * * * * * * * * *

Ribose Lactose

I N K D R F V S E A ~ * ~ * *

I T A K V E A A I K * * * * * ~

Ribose Lactose

E L N Y A P S A L A * * * * * ~ ~ *

R S L K L N Q T H T ~ * * ~

Ribose Lactose

I G M L I T A S T N * * ~ ~

P F Y S E L V R G V ~ * ~ ~ * ~

Ribose Lactose

E R S C F E R G Y S ~ ~ * *

L V L C N T E G D E ~ * ~ *

B

Helix

Helix Turn Helix

THR 59

THR 1

ARG 51

1 A T M K D V A R L A L L H H H H H H H

G V S T S T V S H V L L L H H H H

30 I N K D R F V S E A L L L L L H H

I T A K V E A A I K H H H H H H H H H H

E L N Y A P S A L A H H L L L L H H H H

R S L K L N Q T H T H H H H L L L E

MET 1 NMR Structure of Lactose Operon Repressor

I G M L I E E E E

Turn Helix

Predicted Structure of Ribose Operon Repressor

Figure 2. Sequence homology and structural prediction homology between the ribose and lactose operon repressors. (A) The first 100 amino acids from the ribose and lactose operon repressors were submitted to the BLITZ server. Homologous residues are indicated by a “*”; a partial match is indicated by a “~”. (B) Amino acid sequence (1–65) from the ribose operon repressor protein was submitted to the PHD server; the resulting secondary structure is indicated with symbols where H = helix, L = loop and E = strand.

Figure 3. Comparison of Cα-backbone diagrams of the lactose operon repressor protein and the ribose operon repressor protein. The N - and C -termini are designated with residue numbers, and the significant secondary structures are labeled accordingly.

ment occurred in the first 62 amino acids of the ribose operon repressor protein (Fig. 2). Because the first 62 amino acids had a better alignment (>50%) than the last 38 residues, these amino acids of the ribose operon repressor protein were used for the remainder of the structural analysis. The first 62 amino acid residues of both proteins were then submitted to NNPREDICT (14) and PHD (15) for analysis of secondary structure. Most of the α-helices and β-strands overlapped, suggesting that the proteins would likely have similar secondary structure. A student applied the protein modeling tool SWISSMODEL (16 ) to obtain the predicted coordinates of the ribose operon repressor. The SWISS-MODEL program modeled the submitted sequence based on three proteins with known structure, and the protein modeling service returned predicted coordinates for the N-terminus of the ribose operon repressor. The coordinates for the NMR structure of the lactose operon repressor protein were obtained from the PDB file and saved as a text file. The N-termini of the lactose operon repressor NMR structure and the predicted ribose operon repressor proteins were shown as Cα backbone diagrams (Fig. 3). The student visualized the coordinates using Rasmac (4) and saved the output as a PICT file. The final diagram was labeled and manipulated using Adobe Photoshop. The helix-turn-helix motif characteristic of DNA-binding proteins is easily seen in both models.

termine protein tertiary structure yet lack expensive workstations and modeling software. Presented here was an example where a student predicted the tertiary structure of a fragment from one DNA-binding protein, after first acquiring the skills necessary to utilize the information from the Internet. This paper has illustrated that the many undergraduates who do not have access to complex and usually very expensive modeling tools can still learn or at least gain some experience in protein structure determination through a computer. Moreover, these types of skills are very useful to a student who is interested in bioinformatics or functional genomics. During a time of decreasing funds for scientific research at undergraduate institutions, various methods of educating students in the area of molecular modeling and computational protein analysis are necessary, and one of these choices is the World Wide Web. (An online tutorial for this modeling approach can be accessed at http://www.calpoly.edu/~dleon/Biochem.html.)

Summary Only approximately 10% of all known proteins have an experimentally determined tertiary structure. A relatively novel process known as automated protein modeling (available via the World Wide Web) is an option for those who wish to de-

Acknowledgments We would like to thank Naomi Devlin for proofreading and editing and Mary Rigler for contributing critical comments. Literature Cited 1. Lipkowitz, K. B.; Pearl, G. M.; Robertson, D. H.; Schultz, F. A. J. Chem. Educ. 1996, 73, 105–107. 2. Bernstein, F. C.; Koetzle, T. F.; Williams, G. J. B.; Meyer, E. F., Jr.; Brice, M. D.; Rodgers, J. R.; Kennard, O.; Shimanouchi, T.; Tasumi, M. J. Mol. Biol. 1977, 112, 535–542. 3. Peitsch, M. C. Biochem. Soc. Trans. 1996, 24, 274–279. 4. Sayle, R. RasMol v 2.5: A Molecular Visualisation Program; Glaxo Research and Development: Middlesex, UK, 1994. 5. Van Boom, J. H.; Boelens, R.; Kaptein, R. J. Mol. Biol. 1993, 234, 446.

JChemEd.chem.wisc.edu • Vol. 75 No. 6 June 1998 • Journal of Chemical Education

733

In the Classroom 6. Farabaugh, P. J. Nature 1978, 274, 765–769. 7. Branden, C.; Tooze, J. Introduction to Protein Structure; Garland: New York, 1991; p 87. 8. Lewis, M.; Chang, G.; Horton, N. C.; Kercher, M. A.; Pace, H. C; Schumacher, M. A.; Brennan, R. G.; Lu, P. Science 1996, 271, 1247– 1254. 9. Mauzy, C. A.; Hermodson, M. A. Protein Sci. 1992, 1, 843–849. 10. Protein Identification Resource (PIR); National Biomedical Research Foundation, 3900 Reservoir Road, N. W., Washington, DC 20007, USA. 11. Bairoch, A.; Boeckmann, B. Nucleic Acid Res. 1992, 20, 2019–2022. 12. Singh, R. K., Tell, S. G.; White, C. T.; Hoffman, D.; Chi, V. L.; Erickson, B. W. Proceedings of the Symposium on Integrated Systems,

734

13. 14. 15. 16.

Seattle, WA, Mar 1993; MIT Press: Cambridge, MA, 1993; pp 168– 182. Emmert, D. B.; Stoehr, P. J.; Stoesser, G.; Cameron, G. N. Nucleic Acids Res. 1994, 22, 3445–3449. Kneller, D. G.; Cohen, F. E.; Langridge, R. J. Mol. Biol. 1990, 214, 171–182. Rost, B.; Sander, C. Proc. Natl. Acad. Sci. USA 1993, 90, 7558–7562. Abola, E. E.; Bernstein, F. C.; Bryant, S. H.; Koetzle, T. F.; Weng, J. In Crystallographic Databases—Informational Content, Software Systems, Scientific Applications; Allen, F. H.; Bergerhoff, G; Sievers, R., Eds.; Data Commission of the International Union of Crystallography: Bonn/Cambridge/Chester, 1987; pp 107–132.

Journal of Chemical Education • Vol. 75 No. 6 June 1998 • JChemEd.chem.wisc.edu