Biomacromolecules 2005, 6, 3152-3159
3152
Analysis of the Conserved N-Terminal Domains in Major Ampullate Spider Silk Proteins Dagmara Motriuk-Smith,*,† Alyson Smith,† Cheryl Y. Hayashi,‡ and Randolph V. Lewis† University of Wyoming, Department of Molecular Biology, Laramie, Wyoming 82071, and University of California, Department of Biology, Riverside, California 92521 Received July 4, 2005; Revised Manuscript Received September 15, 2005
Major ampullate silk, also known as dragline silk, is one of the strongest biomaterials known. This silk is composed of two proteins, major ampullate spidroin 1 (MaSp1) and major ampullate spidroin 2 (MaSp2). Only partial cDNA sequences have been obtained for these proteins, and these sequences are toward the C-terminus. Thus, the N-terminal domains have never been characterized for either protein. Here we report the sequence of the N-terminal region of major ampullate silk proteins from three spider species: Argiope trifasciata, Latrodectus geometricus, and Nephila inaurata madagascariensis. The amino acid sequences are inferred from genomic DNA clones. Northern blotting experiments suggest that the predicted 5′ end of the transcripts are present in fibroin mRNA. The presence of more than one Met codon in the N-terminal region indicates the possibility of translation of both a long and a short isoform. The size of the short isoform is consistent with the published, cDNA based, N-terminal sequence found in flagelliform silk. Analyses comparing the level of identity of all known spider silk N-termini show that the N-terminus is the most conserved part of silk proteins. Two DNA sequence motifs identified upstream of the putative transcription start site are potential silk fibroin promoter elements. Introduction Major ampullate silk is one of seven different types of silk spun by orb-weaving spiders. Major ampullate silk is crucial for the survival of spiders because of its dual role as a safety dragline and essential element of foraging webs. Peptide and cDNA sequencing reveal that this silk consists of at least two proteins: major ampullate spidroin 1 (MaSp1) and major ampullate spidroin 2 (MaSp2). MaSp proteins identified from several spider species share common features. They are very large proteins, ranging from 260 to 650 kDa1,2 and are predominantly composed of repetitive units. Each repetitive region is built from short amino acid sequence motifs. The MaSp1 protein has GGX, (GA)n, and (A)n motifs, while the MaSp2 protein has (A)n and GPGXX motifs (X represents a limited subset of amino acids). Like all members of the spider silk protein family, MaSp1 and MaSp2 end with distinct, nonrepetitive C-terminal regions. The C-terminal region is conserved in sequence and length even among phylogenetically distant species and among silk types.3-6 The majority of MaSp silk protein sequences are deduced from partial cDNA sequences. Full-length cDNA sequences have yet to be reported for MaSp protein. Partial cDNA sequences of almost all silks lack the 5′ end of the sequence with a notable exception of flagelliform silk.7 The identification of silk protein N-termini will provide a better understanding of the biochemistry of silk proteins. N-termini sequences may also yield insight into fiber for* Corresponding author. Phone: 307-268-2542. Fax: 307-268-2416. E-mail:
[email protected]. † University of Wyoming. ‡ University of California, Riverside.
mation and the development of more efficient production of recombinant silk protein. Furthermore, the DNA and amino acid sequences of the N-termini can be used as new markers to identify spider silks and their evolutionary relationships. No published data have been reported for the N-terminal part of the major ampullate silk protein. Attempts to sequence the N-terminal portion of the protein using Edman degradation method have been disappointing because the resulting fragments are usually from the repetitive part of the protein.2,8 Genomic DNA sequencing is one of the approaches that proved to be successful in identification of the 5′ end of a silk gene and, thus, the N-terminus of the flagelliform silk fibroin.9 In this study, the genomic sequences of the 5′ end of MaSp silk fibroin genes are reported from three spider species: Argiope trifasciata (At), Latrodectus geometricus (Lg), and Nephila inaurata madagascariensis (Nim). Deduced amino acid sequences allow for comparison of the N-terminal silk sequences among fibroin types and several spider species. Materials and Methods Genomic DNA Library Construction and Screening. The three spider species used for this study were A. trifasciata, L. geometricus, and N. inaurata madagascariensis. Cephalothoraxes, the genomic DNA source, were stored at -80 °C. Genomic DNA was isolated by homogenization of tissue in liquid nitrogen and extraction using a standard SDS-proteinase K method.10 Genomic libraries for each of the three species were constructed using Lambda Fix II/Xho I partial fill-in vector kit (Stratagene, La Jolla, CA). The process of insert ligation, packaging, tittering, and
10.1021/bm050472b CCC: $30.25 © 2005 American Chemical Society Published on Web 10/08/2005
Conservation of Spider Silk N-Termini
plaque lifts was performed according to the manufacturer’s instructions. Oligonucleotide probes (5′-GGW GWG CWG GAC AAG GAG G-3′ and 5′-CAA CAA GGA TAY GGA CCA GG-3′) specific to the repetitive regions of MaSp1 and MaSp2 fibroins were used to screen the Lg library. The Nim library was screened with the probe 5′-CCW CCW GGW CCN NNW CCW CCW GGW CC-3′. The probes were kinase labeled using [γ-32P] dATP (7000 Ci/mmol). The At library was screened with MaSp1 and MaSp2 cDNA fragments (AF350266 and AF350267) labeled using a random primer method.11 Positive recombinant clones were screened for genomic insert size, and the largest ones were selected for the subcloning into the pGEM-T Easy plasmid vector (Promega, Madison, WI). Nim and At silk positive clones were sequenced randomly until the 5′ end of the silk gene was found. Once the 5′ end of silk for Nim and At was identified, instead of random sequencing, two oligonucleotide probes (5′-GCW TTY GCT TCM TCR ATG GCM GAA-3′ and 5′-GCM YTT GCT TCY TCW ATR GCC GAA-3′) were designed to identify the presence of the 5′ end of silk gene in the Lg library. Nine Lg clones, previously identified to contain silk repeats, were subjected to a restriction digest. Once the inserts were cut out, the DNA was run on an agarose gel and subjected to a Southern blotting. Hybridization with QuikHyb (Stratagene, La Jolla, CA) was performed at 42 °C for 15 h. Washes were 15 min at RT (2× SSC, 0.1% SDS), 15 min at 42 °C (2× SSC, 0.1% SDS), and 7 min at 45 °C (0.1× SSC, 0.1% SDS). Polymerase Chain Reaction (PCR). Genomic DNA from At was isolated as described above. PCR primers 5′-TGG GAT GTT CAT TTA GGT TCT GG-3′ and 5′-CCA ACC AWT TGC GCA TAC TG-3′ were designed to amplify the 3′ end of the At.MaSp2 gene. PCR conditions were optimized for Taq polymerase recombinant (Invitrogen, Carlsbad, CA). Initial template denaturation was set for 3 min at 94 °C. Repeated cycles (35 times) had the following conditions: 1 min at 94 °C, 1 min at 55 °C, and 3 min 30 s at 72 °C. The final extension at 72 °C was set for 7 min. The PCR product was resolved on an agarose gel, and the largest band was excised for cloning into the TOPO XL vector (Invitrogen, Carlsbad, CA). Three recombinant clones were selected for sequencing, aligned, and the final DNA sequence was corrected for Taq polymerase errors by taking the majority rule consensus. Sequence Analysis. Sequencing reactions were performed at the Brigham Young University DNA Sequencing Center (http://dnasc.byu.edu). Transposon EZ::TN (Epicentre, Madison, WI) insertion enabled internal sequencing of large genomic DNA inserts. Three sets of sequencing primers were used: T7 and M13R (TOPO XL vector), T7 and Sp6 (pGEM-T Easy vector), and TET-1 FP-1 and TET-1 RP-1 (Transposon EZ::TN). Sequencing text files were compared to chromatograms and were edited manually. Complete sequences were analyzed using MacVector software (Accelrys, San Diego, CA). Northern Blots. Major ampullate glands were dissected from euthanized At and Lg spiders. Due to the unavailability of live Nim, Nephila claVipes (Nc) was used instead. The
Biomacromolecules, Vol. 6, No. 6, 2005 3153
close relationship between MaSp silk fibroins of these two congeneric species was previously determined.6 Prior to dissections, spiders were subjected to forcible silking to stimulate mRNA synthesis.12 Glands were harvested within 24-48 h after silking, immediately flash frozen in liquid nitrogen, and stored at -80 °C. Four to eight glands were homogenized in 1 mL of Tri reagent (Molecular Research Center, Inc., Cincinnati, OH), and total RNA was extracted. An amount of 10 µg of total RNA per lane was resolved on 0.7% agarose formaldehyde denaturing gel with 1× MOPS as a running buffer. The RNA ladder (New England Biolabs, Beverly, MA) was used as a size marker. The gel was briefly rinsed with water and blotted overnight onto a Hybond N+ membrane (Amersham, Piscataway, NJ) with 10× SSC as a transfer buffer. The membrane was momentarily rinsed with 6× SSC and baked at 80 °C for 2 h. The baked membrane was stained with methylene blue. The RNA marker was used to plot the log of a molecular size versus migration distance. The extension of the curve was used to estimate the molecular size of mRNA bands. The equal intensity of ribosomal RNA bands verified concentration uniformity of the loaded samples. Hybridization was performed at 42 °C using QuikHyb (Stratagene, La Jolla, CA). A set of six probes was designed for each species. The number following the species name refers to the predicted annealing location for each of the probes. No. 1 was designed to anneal to the most 5′ end location, upstream of the predicted transcription start site. Nos. 2-6 were designed to anneal to the region from +4 to +370 of the mRNA sequence, with 2 being the most upstream location and 6 the most downstream. The following probe sequences (kinase labeled) were used for different hybridizations: (At1) 5′-AAC TTT CTC TCT CTT TTA TA-3′, (Lg1) 5′-ACT TTT TCA AAG TTT TC-3′, (Nc1) 5′-CTG AGC ATT GAA TGT TAC-3′, (At2) 5′-GAG AAT GGA ACT AGT CCG-3′, (Lg2) 5′-CAT TGG GAA ATC CCG ACT G-3′, (Nc2) 5′-GAG AAC TGA TCT CCT CTG-3′, (At3, Lg3, Nc3) 5′-CYT GKC CWG CAR WAA A-3′, (At4, Lg4, Nc4) 5′-RGA CAT ATC ATC MAG TTG-3,′ (At5, Nc5) 5′-TTC KGC CAT YGA KGA AGC RAA WGC-3′, (Lg5) 5′-TTC TGC CAC AGA TGA TGC GAA AGC-3′, (At6, Nc6) 5′-GCR ATN GCA TTK GTT TT-3′, (Lg6) 5′-ATN GCA TTK GTN GTN AC-3′. Each of the 12 probes was carefully designed not to cross-react with the repetitive portion of the coding regions and was expected to anneal only to the desired 5′ end of the mRNA sequence. Three low-stringency washes were performed: 10 min at 42 °C (2× SSC, 0.1% SDS), 12 min at 42 °C (2× SSC, 0.1% SDS), 5 min at 42 °C (0.1× SSC, 0.1% SDS). Total RNA extracted from cephalothoraxes, which do not contain any silk glands, served as a negative control in each hybridization experiment. Results Identification of the Major Ampullate Silk Genes and Proteins. Several genomic DNA inserts were selected, by oligonucleotide hybridization, for partial sequencing from each library. Silk fibroin coding sequences were identified based on guanine- and cytosine-rich transcripts and their
3154
Biomacromolecules, Vol. 6, No. 6, 2005
similarity to previously reported data. The amino acid sequence was deduced based on the predicted translational start site and the longest open reading frame. Three fibroin sequences (one from each library) contained a noncoding sequence upstream of the transcription start site, 5′ UTR, start codon, and a coding sequence representing a nonrepetitive N-terminus with downstream silk sequences. These sequences were designated as A. trifasciata major ampullate spidroin 2 (At.MaSp2), L. geometricus major ampullate spidroin 1 (Lg.MaSp1), and N. inaurata madagascariensis major ampullate spidroin 2 (Nim.MaSp2).13 Newly identified genomic sequences were classified as either coding for MaSp1 or MaSp2. This assignment was based on the following criteria. Translated sequences were assigned as MaSp1 if the GGX and (A)n were found to be the predominant amino acid motifs present in the repetitive portion of the sequence. Sequences were assigned as MaSp2 if GPGXX and (A)n were the predominant amino acid motifs. The exact classification of newly identified silk sequences can sometimes be problematic due to the evidence of more than one gene copy of MaSp26 as well as allelic variations among individuals (J. Gatesy, personal communication). The newly identified MaSp sequences shared the most similarity to previously reported protein sequences: At.MaSp2 (GenBank accession number AF350267), Nim.MaSp2 (AF350276), and Lg.MaSp1 (AF350273).6 Partial sequences encoding silk repeat and the C-terminal end were also included in this study. The newly identified genomic DNA sequences were translated into protein sequences. The At.MaSp2 genomic fragment (DQ059136) translated into 661 amino acids of silk protein that included the nonrepetitive N-terminus. The analysis of the gene architecture revealed the presence of introns. This was the only sequence of the three that had any introns present. Introns 2 and 3 were nearly identical in size (1792 and 1793 nucleotides respectively) and also extremely conserved in sequence (99.5%) (see the Supporting Information, Figure 1). In addition, a PCR product (DQ059137) represented a coding region of 230 amino acids of the repetitive silk region and the nonrepetitive C-terminus (see the Supporting Information, Figure 2A). The Lg.MaSp1 genomic insert (approximately 20 kb) was partially sequenced. The coding region of the first fragment (DQ059133) translated into 417 amino acids and consisted of silk sequence and the nonrepetitive N-terminus. The second fragment (DQ059134) translated into 637 amino acids of silk and included the nonrepetitive C-terminus (see the Supporting Information,Figure2B).TheNim.MaSp2sequence(DQ059135) represented 2068 amino acids corresponding to the repetitive region and the nonrepetitive N-terminus (see the Supporting Information, Figure 2C). Protein-protein BLAST (blastp) searches on sequences containing C-termini showed that At.MaSp2 scored the highest on Argiope trifasciata MaSp2 (AF350267) and Lg.MaSp1 scored the highest on Latrodectus geometricus MaSp1 (AF350273)6 confirming their identities. Blastp searches for all three sequences using the N-terminal region had three top matches: flagelliform silk from Nephila inaurata madagascariensis (AF218623),9 flagelliform silk
Motriuk-Smith et al.
Figure 1. Northern blot analyses of (A) At, (B) Lg, and (C) Nc mRNAs. For each species, six independent Northern blots were performed using total mRNA extracted from major ampullate silk glands. The arrows indicate the approximate size of identified mRNA transcripts in kilobases. The oligonucleotide name used in the hybridization is listed above each lane, and each oligonuclotide sequence is described in the Materials and Methods.
from Nephila claVipes (AF027972),7 and hypothesized dragline silk protein from Araneus Ventricosus (AY945306) (unpublished). Genomic DNA Sequence Alignment. Alignment of the genomic DNA representing the 5′ end of each of the three genes is divided into two regions, upstream and downstream of the presumed transcription start site (TSS) (see the Supporting Information, Figure 3). Position +1 (adenine) was assigned as the putative TSS. Northern blotting experiments (results described below) were used to estimate the position of TSS. An adenine with pyrimidines in proximity (present in all three sequences) is characteristic of TSS described in many transcripts. The downstream region spans almost 500 nucleotides with 48% sequence identity. The first Met codon in frame for the At.MaSp2 and the Nim.MaSp2 sequence starts at position +29. The first Met codon in frame in the Lg.MaSp1 sequence is at position +157. There are multiple Met codons in frame in each sequence. ATG codons at position +250 in Lg.MaSp1, +251 in At.MaSp2, and +248 in Nim.MaSp2 are proposed to represent the first amino acid of a putative short isoform. The second alignment of approximately 700 nucleotides upstream of each putative TSS is very adenine and thymine rich and has less conservation (25% sequence identity). There were only two regions of conserved sequence elements, which appear significant due to their position with respect to the putative TSS. The longest conserved sequence TATAAAA is located between position -25 and -31. Another motif, CACG, is located between positions -43 and -46. Putative 5′ End Fibroin Sequences Present in mRNA. Eighteen Northern blotting experiments (six for each species) were performed to test the presence of the predicted 5′ end of silk fibroin in mRNA (Figure 1). First, oligonucleotides designated as Lg1, At1, Nc1 were designed to anneal to the region spanning approximately from -31 to -3 in each sequence (see the Supporting Information, Figure 3). North-
Conservation of Spider Silk N-Termini
ern blotting experiments detected no signal, strongly suggesting that this part of the gene, as predicted, is not present in mRNA. Second, a series of oligonucleotides Lg2-6, At2-6, Nc2-6 were designed to anneal to the mRNA sequence approximately from +4 to +370 (see the Supporting Information, Figure 3). Northern blotting experiments resulted in the following banding pattern. A. trifasciata had 11 kb (probe At2, At4, and At6) and 9.8 kb (At4, At6) bands present. Diffused signal was identified with probe At3 (Figure 1A). L. geometricus had 11.5 kb and 10.5 kb transcripts present (probe Lg2, Lg3, Lg5, and Lg6). The 10.5 kb (Lg2) and 11.5 kb (Lg6) bands had lower signal intensities (Figure 1B). N. claVipes had 12.5 kb and 11 kb transcripts present (Nc3-5). The 11 kb (Nc4) band had lower signal intensity. In addition, low-intensity signals at 9.5 kb were identified with probes Nc3 and Nc4 (Figure 1C). Multiple sizes and varied intensities of the identified bands are most likely caused by cross hybridization of oligonucleotides to major ampullate silk protein transcripts (such as MaSp1 or MaSp2). The presence of allelic variations or multiple gene copies are also likely explanations. Similar mRNA transcript sizes for MaSp1 and MaSp2 have been reported previously in several different spider species.5,14-16 Exceptions were probes Nc2 and Nc6, which did not generate any signal, possibly because of mismatches between Nim and Nc genomic sequences. Also, oligonucleotides Lg4 and At5 for unknown reasons generated no signal. Nonetheless, taken together, the Northern blot mapping demonstrates that the sequence from approximately +4 to +370 position is present in the mRNAs of major ampullate silk glands from each of the three species investigated. N-Terminal Domain of the Long Isoform. The amino acid sequence representing the N-termini of each major ampullate silk protein was deduced from its respective genomic DNA sequence (Figure 2). The size of the Nterminus starting with the first Met varies from 156 amino acids in At.MaSp2 to 109 amino acids in Lg.MaSp1. The first Met codon in frame is considered the most common translation start site.17,18 Silk proteins starting with the first Met codon in frame will be referred to as the long isoform. Signal peptide cleavage sites were predicted in At.MaSp2 and Nc.MaSp2 sequences19 with a very high signal peptide probability (0.99). The Lg.MaSp1 protein was predicted to be a nonsecretory protein with a 0.0 signal peptide probability. It has been reported20 that the computational predictions are only 78% accurate compared to the experimental data, so it is possible that Lg.MaSp1 protein falls into the nonpredictable category. Alignment of the long isoforms of all three silks is presented in Figure 3A. N-Terminal Domain of the Short Isoform. The presence of multiple Met codons in all three sequences suggests the possibility of more than one silk isoform. Adenine at position -3, with respect to the ATG, in mammalian17 and Drosophila21 sequences enhances the likelihood of the translation start site. A Met codon in this “optimal context” is referred to as a strong start codon. There were numerous Met codons which could be defined as strong start codons. They include Met in position 1, 75 and 91 in At.MaSp2; 32 and 99 in Lg.MaSp1; 1, 68, 74, 90, 109 and 145 in Nim.MaSp2.
Biomacromolecules, Vol. 6, No. 6, 2005 3155
However, there is only one strong start codon located in a conserved position in all 3 species, namely, Met in position 75 in At.MaSp2, 32 in Lg.MaSp1, and 74 in Nim.MaSp2. The N-terminus starting at this position will be referred to as a short isoform. These proposed starts of the N-termini are similar to the N-terminal start derived from flagelliform cDNA.7 The signal peptide probability in At.MaSp2 (0.79) and Nim.MaSp2 (0.51) was lower than that of the long isoform. Lg.MaSp1 was again predicted as a nonsecretory protein; however, the signal peptide probability increased to 0.41. The alignment of the N-termini of the short isoforms is presented in Figure 3B and includes 3 fibroin types and 4 spider species. Discussion In the few characterized spider fibroin cDNA and genomic DNA sequences, the spider fibroin N-terminal region, just like the C-terminal region, is a short, nonrepetitive protein domain. Recent findings suggest that the C-termini regions are involved in silk processing22 and are present in the spinning dope as well as in the silk fibers.23 The N-terminal sequence alignment (short isoform) presented in Figure 3B among three different silk orthologs MaSp1, MaSp2, Flag shows 33% identity, while the alignment of the corresponding C-termini shows only 18%. The N-terminal regions of the short isoforms are the most conserved part of the silk proteins analyzed. Pairwise comparison of N-termini showed that the most identity is shared between two groups of orthologous fibroins, namely Flag (Nc.Flag and Nim.Flag: 100% identity) and MaSp2 (At.MaSp2 and Nim.MaSp2: 57% identity). Phylogenetic trees constructed using the C-terminal sequences also showed a cluster of Flag homologues and a cluster of major ampullate homologues.24,25 Similarly, it is envisioned that as more N-termini sequence data become available, the evolutionary relationships among silk fibroins will be better understood. The identified N-termini are very distinct parts of each silk protein due to their unique amino acid sequences. Amino acids that are rarely found in the repetitive part of the silk sequence, such as M, F, T, N, E, and L, are present in the N-terminus. Hydrophobic amino acids such as V, L, and I comprise almost 20% of this region. The same subset of hydrophobic amino acids is also found in C-terminal part of silk. A and G are not found as frequently (A, 17%, and G, 5%) in the N-terminus as in their respective silk repeats (A, 27%, and G, 36%). Furthermore, in the N-terminus, these two amino acids are not organized into the typical amino acid motifs found in the repetitive regions. Chou-Fasman structural analysis26 predicts a helix as the predominant secondary structure of the N-terminus (Table 1). A leader peptide in helical conformation is in agreement with the helical hairpin hypothesis.27 This hypothesis states that the leader peptide in helical conformation is followed by a hairpin turn and then by another helical region of the protein itself. The structural prediction of the N-terminal region in all five sequences analyzed suggests the possibility of the helix-turn-helix conformation. However, the size of the helix and turn structure is not consistent among the
Biomacromolecules, Vol. 6, No. 6, 2005
Figure 2. Deduced translation of N-terminal sequences: (A) At.MaSp2, (B) Lg.MaSp1, and (C) Nim.MaSp2. DNA sequences start with the predicted transcription start site (+1) and end with the N-terminal coding region (further downstream are the silk repetitive sequences; see the Supporting Information, Figure 2). Met residues representing the first amino acid of the long or short isoform are shaded. Carat symbols (the predicted signal peptide cleavage site) and the first amino acid of the mature protein (long or short isoform) are boxed and shaded. Light gray represents the long isoform, dark gray is the short isoform. Every Met in frame is shown in bold. Every adenine at the -3 position (with respect to the potential start codon) is in lower case and bold.
3156 Motriuk-Smith et al.
Biomacromolecules, Vol. 6, No. 6, 2005 3157
Conservation of Spider Silk N-Termini
Figure 3. N-terminal sequence alignments: (A) long isoform and (B) short isoform. The protein sequences start with the first Met in frame. Protein sequences in (A) reported in this study are derived from genomic DNA and correspond to the following GenBank accession numbers: Lg.MaSp1 (DQ059133), At.MaSp2 (DQ059136), and Nim.MaSp2 (DQ059135). Met residues downstream of the first, which are the first amino acid of the short isoform, are bold. Short isoform alignment (B) represents the five known N-terminal sequences. Two additional sequences were used here: Nc.Flag, flagelliform silk from N. clavipes cDNA (AF027972), and Nim.Flag, flagelliform silk from N. inaurata madagascariensis genomic DNA (AF218623). In each aligned position, identical amino acids have a black background, and chemically similar amino acids are shaded in gray. Table 1. Summary of the Predicted Helical, Turn, and Hydrophobic Regions in All N-Terminal Sequences Analyzeda
a
N-terminus sequence source (short isoform)
helix (amino acid range)
turn (amino acid range)
hydrophobic region (amino acid range)
Nc.Flag Nim.Flag At.MaSp2 Lg.MaSp1 Nim.MaSp2
5-33; 59-79 5-33; 59-79 1-56 7-29; 37-48; 67-75 5-32; 45-32; 64-79
2-5; 30-35; 53-60 2-5; 30-35; 53-60 5-11; 30-33; 54-57; 77-80 6-9; 29-32; 35-38; 73-76 3-11; 30-34; 54-60
18-24; 47-48; 67-69 18-24; 47-48; 67-69 14-24; 46-69 12-20; 43-70 14-22; 47-55; 65-74
Five short isoform N-terminal sequences (Figure 3B) were included in this analysis.
different N-termini. The only consistent structural predictions in all of the analyzed sequences are helical regions ranging from amino acid position 7 to position 29. The SweetEisenberg hydrophobic domains28 for all five sequences fall within the above range of amino acid positions (Table 1). In addition, presence of a putative leader peptide is in agreement with silk gland histological observations. A silk gland (serving as silk protein storage) is lined with secretory, columnar epithelial cells. The process of protein synthesis and secretion occurs in these specialized cells.29-31 A leader peptide serves as a signal to transport the protein into the endopasmic reticulum, and via the Golgi apparatus, the protein is secreted out of the cell. Golgi vesicles inside the epithelial secretory cells were shown to increase in number and size during silk protein synthesis,32 suggesting their direct involvement in this process.
There are multiple Met codons found in the N-termini of each silk protein. This amino acid has been found infrequently in the repetitive or C-terminal silk protein sequences. Selection for high expression levels of silk genes may explain this frequent Met codon occurrence. Dragline silk, essential for spider survival, has to be generated in large quantities, which is tightly linked to silk protein synthesis. The presence of additional Met codons downstream of the first one may create additional translation start sites. In the event of mutation of the first Met codon or leaky scanning past the first Met, silk protein translation can be initiated downstream. Leaky scanning, where the first and second ATG codons are in the same reading frame, is well documented, for example, in the microtubule-associated protein gene in humans.33 Synthesis of more than one silk protein isoform might be another explanation for the presence of multiple
3158
Biomacromolecules, Vol. 6, No. 6, 2005
Met codons. We suggest the possibility of long and short protein isoform synthesis; however, due to the large size of silk proteins, the isoforms would not be distinguishable on any gel system. The short isoform seems to be a more logical possibility mainly due to the positionally conserved optimal context Met codon and N-terminal size similarity to the flagelliform silk. Synthesis of long and short isoforms has been shown in Ovo protein in Drosophila.34 The size of the N-terminal region may not be important as long as the mature protein is properly secreted, folded, and functional. An example of two protein isoforms with the same function is the β-1,4-galactosyltransferase protein found in mice.35 The DNA alignment of sequences upstream of the TSS does not show a high level of identity. The sequences involved in transcriptional regulation are diversified compared to the coding region. There are two motifs found in all three sequences analyzed, the TATA box and the CACG motif. The identified TATA box motif fits the classical example in both its sequence and location with respect to the predicted TSS. The same conserved motif is also present in the flagelliform silk fibroin (AF218623). The TATA box was also found in lepidopteran silk sequences36-38 suggesting that this motif is an essential component of even convergently evolved silk sequences. Toxin sequences identified in arachnid venoms were also reported to contain a TATA box element in the upstream regulatory regions.39,40 The CACG motif does not resemble any well-characterized promoter sequence element. However, its location at -40 suggests that it might be a part of a promoter. To date, there has been no report on any spider silk promoter; therefore, we cannot speculate about the functionality of the identified motifs. Alignment of the DNA sequences starting downstream of the TSS has a much higher level of conservation. This sequence similarity has been maintained for 125 million years, since the divergence of Araneidae (Argiope) and derived araneoids (Nephila and Latrodectus).41 This protein coding region is likely of fundamental importance. In summary, this study contributes to the current knowledge of major ampullate spider silk genes and proteins. Major ampullate silk proteins from different species have a similar N-terminal sequence. This relatively short domain differs in amino acid sequence and predicted structure from the repetitive part of the silk protein. The presence of multiple Met codons suggests the possibility of translation of both a long and a short silk isoform. In addition, predicting the presence of the leader peptide sequence confirms that silk proteins belong to the group of secretory proteins. Function of the mature N-terminal region remains unknown. The genomic DNA sequences allow investigating the molecular architecture of MaSp genes. The At.MaSp2 is the only major ampullate fibroin found to contain introns. The presence of introns in Lg.MaSp1 and Nim.MaSp2 sequences cannot be ruled out until entire gene sequences are identified. Furthermore, comparisons of the upstream DNA sequences show two important features. First, the coding DNA region maintains a high level of sequence identity among the three genes analyzed. Second, comparison of sequences upstream of the predicted TSS identified two conserved sequence motifs that may represent promoter elements.
Motriuk-Smith et al.
Acknowledgment. Funding for this research was provided by the National Institutes of Health and the National Science Foundation. We thank F. Teule for help with the experimental design and J. Gatesy for help with the data collection. We also thank B. W. Smith, M. Hinman, and anonymous reviewers for their comments on the manuscript. Supporting Information Available. Molecular architecture of the At.MaSp2 gene (Figure 1); repetitive nature of silk protein sequences with the nonrepetitive C-termini and N-termini (Figure 2); alignment of identified genomic DNA sequences that include the noncoding and protein coding regions (Figure 3). This material is available free of charge via the Internet at http://pubs.acs.org. References and Notes (1) Candelas, G.; Candelas, T.; Ortiz, A.; Rodriguez, O. Biochem. Biophys. Res. Commun. 1983, 116, 1033-8. (2) Sponner, A.; Schlott, B.; Vollrath, F.; Unger, E.; Grosse, F.; Weisshart, K. Biochemistry 2005, 44, 4727-36. (3) Hinman, M. B.; Lewis, R. V. J. Biol. Chem. 1992, 267, 19320-4. (4) Beckwitt, R.; Arcidiacono, S. J. Biol. Chem. 1994, 269, 6661-3. (5) Guerette, P. A.; Ginzinger, D. G.; Weber, B. H.; Gosline, J. M. Science 1996, 272, 112-5. (6) Gatesy, J.; Hayashi, C. Y.; Motriuk, D.; Woods, J.; Lewis, R. Science 2001, 291, 2603-5. (7) Hayashi, C. Y.; Lewis, R. V. J. Mol. Biol. 1998, 275, 773-84. (8) Mello, C. M.; Senecal, K.; Yeung, B.; Vouros, P.; Kaplan, D. In Silk Polymers; Kaplan, D., Adams, W. W., Farmer, B., Viney, C., Eds.; American Chemical Society: Washington, DC, 1994; pp 6779. (9) Hayashi, C. Y.; Lewis, R. V. Science 2000, 287, 1477-9. (10) Sambrook, J.; Russel, D. W. Molecular Cloning: A Laboratory Manual, 3rd ed.; Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY, 2001; pp 6.4-6.11. (11) Feinberg, A. P.; Vogelstein, B. Anal. Biochem. 1983, 132, 6-13. (12) Candelas, G. C.; Cintron, J. J. Exp. Zool. 1981, 216, 1-6. (13) Gene designations as MaSp1 or MaSp2 based on partial sequences are tentative. The absence of full-length cDNAs or genes makes it impossible to be certain of connection between a 5′ end gene fragment and a 3′ end gene fragment. Because of their similar repetitive regions At. MaSp2 genomic sequences (DQ059136-7) and cDNA At. MaSp2 sequence (AF350267) are thought to represent the same gene. Lg. MaSp1 genomic sequences (DQ059133-4) differ from the repeats found in cDNA (AF350273), and therefore, these sequences are reported in GenBank as Lg. MaSp1-like. Nim. MaSp2 (DQ059135) includes the sequence classified as Nim. MaSp2-like (AF350276). The repeat sequences of DQ059135/AF350276 differ from the repeats of the gene fragment, Nim. MaSp2 (AF350278). Hence, DQ059135, like AF350276, is designated as MaSp2-like in the GenBank entry. (14) Xu, M.; Lewis, R. V. Proc. Natl. Acad. Sci. U.S.A. 1990, 87, 71204. (15) Hayashi, C. Y.; Shipley, N. H.; Lewis, R. V. Int. J. Biol. Macromol. 1999, 24, 271-5. (16) Pouchkina-Stantcheva, N. N.; McQueen-Mason, S. J. Comp. Biochem. Physiol., B 2004, 138, 371-6. (17) Kozak, M. Nucleic Acids Res. 1987, 15, 8125-48. (18) Kozak, M. J. Cell Biol. 1989, 108, 229-41. (19) Nielsen, H.; Engelbrecht, J.; Brunak, S.; von Heijne, G. Protein Eng. 1997, 10, 1-6. (20) Zhang, Z.; Henzel, W. J. Protein Sci. 2004, 13, 2819-24. (21) Cavener, D. R. Nucleic Acids Res. 1987, 15, 1353-61. (22) Jin, H.; Kaplan, D. L. Nature 2003, 424, 1057-61. (23) Sponner, A.; Unger, E.; Grosse, F.; Weisshart, K. Biomacromolecules 2004, 5, 840-845. (24) Hayashi, C. Y. In Molecular Systematics and EVolution; DeSalle, R., Girbet, G., Wheeler, W., Eds.; Birkhauser Verlag: Switzerland, 2002; pp 209-223. (25) Hayashi, C. Y.; Blackledge, T. A.; Lewis, R. V. Mol. Biol. EVol. 2004, 21, 1950-9. (26) Chou, P. Y.; Fasman, G. D. Biochemistry 1974, 13, 211-22. (27) Engelman, D. M.; Steitz, T. A. Cell 1981, 23, 411-22. (28) Sweet, R. M.; Eisenberg, D. J. Mol. Biol. 1983, 171, 479-88. (29) Bell, A. L.; Peakall, D. B. J. Cell Biol. 1969, 42, 284-95.
Conservation of Spider Silk N-Termini (30) (31) (32) (33) (34)
Kovoor, J. Ann. Sci. Nat., Zool. 1972, 14, 1-40. Vollrath, F.; Knight, D. P. Nature 2001, 410, 541-8. Plazaola, A.; Candelas, G. C. Tissue Cell 1991, 23, 277-84. Su, L. K.; Qi, Y. Genomics 2001, 71, 142-9. Me´vel-Ninio, M.; Fouilloux, E.; Gue´nal, I.; Vincent, A. DeVelopment 1996, 122, 4131-8. (35) Harduin-Lepers, A.; Shaper, J. H.; Shaper, N. L. J. Biol. Chem. 1993, 268, 14348-59. (36) Tsujimoto, Y.; Suzuki, Y. Cell 1979, 18, 591-600.
Biomacromolecules, Vol. 6, No. 6, 2005 3159 (37) Sezutsu, H.; Yukuhiro, K. J. Mol. EVol. 2000, 51, 329-38. (38) Zˇ urovec, M.; Sehnal, F. J. Biol. Chem. 2002, 277, 22639-47. (39) Krapcho, K. J.; Kral, R. M.; Vanwagenen, B. C.; Eppler, K. G.; Morgan, T. K. Insect Biochem. Mol. Biol. 1995, 25, 991-1000. (40) Delabre, M. L.; Pasero, P.; Marilley, M.; Bougis, P. E. Biochemistry 1995, 34, 6729-36. (41) Selden, P. A. Paleontology 1990, 33, 257-85.
BM050472B