GlycoSiteAlign: Glycosite Alignment Based on Glycan Structure

GlycoSiteAlign is a tool designed to align amino acid sequences of variable length surrounding glycosylation sites depending on the knowledge of glyca...
0 downloads 9 Views 2MB Size
Subscriber access provided by Northern Illinois University

Technical Note

GlycoSiteAlign: glycosite alignment based on glycan structure Alessandra Gastaldello, Davide Alocci, Jean-Luc Baeriswyl, Julien Mariethoz, and Frédérique Lisacek J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.6b00481 • Publication Date (Web): 15 Aug 2016 Downloaded from http://pubs.acs.org on August 16, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

GlycoSiteAlign: glycosite alignment based on glycan structure Alessandra Gastaldello,∗,†,‡ Davide Alocci,†,‡ Jean-Luc Baeriswyl,¶,† Julien Mariethoz,†,‡ and Frederique Lisacek∗,†,‡,¶ †Proteome Informatics Group, SIB Swiss Institute of Bioinformatics, 7 route de Drize, 1227 Geneva, Switzerland ‡Computer Science Department CUI, University of Geneva, 1227 Geneva, Switzerland ¶Section of Biology, Faculty of Sciences, University of Geneva, 1211 Geneva, Switzerland E-mail: [email protected]; [email protected] Phone: +41 (0)22 379 01 59; +41 (0)22 379 58 98 Abstract GlycoSiteAlign is a tool designed to align amino acid sequences of variable length surrounding glycosylation sites depending on the knowledge of glycan structure. It is an exploratory resource intended for the identification of characteristic amino acid patterns of unique glycan-protein interactions. GlycoSiteAlign uses data from the UniCarbKB and UniProtKB databases and it is hosted on ExPASy, the Swiss Institute of Bioinformatics resource portal. The user can select either specific or general glycan features, set the length of the protein fragments and trigger an alignment with the option of including 90% homologous proteins. The tool previews and/or downloads alignments which may reveal amino acid patterns corresponding to selected features (e.g. “fucosylated” vs “non-fucosylated”). GlycoSiteAlign will integrate new data as they become available to confirm and expand results. It is presented as a promising

1 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

tool to assess and refine the knowledge about the constraints that link a particular glycan structure to a particular glycosite and in the long term this application could help improve prediction tools.

Keywords glycoprotein, sequence alignment, glycosylation, glycan structure, glycosite, amino acid patterns

INTRODUCTION Protein glycosylation is the enzyme-catalyzed covalent attachment of a glycan to a polypeptide resulting in the formation of a glycoprotein. It is one of the most common co-translational /post-translational modification taking place in the endoplasmic reticulum (ER) and Golgi apparatus. There are three major types of protein glycosylation: N-, O- and C-glycosylation. In N-glycosylation, glycans are attached to a nitrogen atom of an asparagine (N) residue in the following protein consensus sequence: N-X-S/T where S is a serine, T a threonine and X any amino-acid except proline (P). In O-glycosylation, glycans are instead linked to an oxygen atom of a S or a T residue and, rarely, to a hydroxyproline or a tyrosine. 1,2 No particular consensus sequence has been detected for O-glycosylation, even if it occurs usually in a portion of sequence rich in hydroxy amino acids and, as shown by sequence alignment studies, proline residues frequently occur around the glycosylation sites especially at amino acid position -1 and +3 (by convention, position zero corresponds to the glycosylated residue). 3,4 To the contrary, the presence of charged amino acids at these sites tends to prevent the binding of glycans. 5,6 N-linked and O-linked glycosylation are the most abundant types of glycosylation in mammals. 7 In C-glycosylation, also called C-mannosylation, glycans get attached to the carbon atom of a tryptophan (W) residue and the W-X-X-W motif was identified as the acceptor consensus sequence for glycan binding. Few human proteins such as the 2 ACS Paragon Plus Environment

Page 2 of 33

Page 3 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

thrombospondin-1 (TSP-1), the ribonuclease-2 and the beta subunit of the interleukin-12 display this type of glycosylation. 8–10 Notable advance has been made in the comprehension of glycan function and characterisation of their structure, but some fundamental aspect of glycan biology remain poorly known and understood. Among these, are the mechanisms of glycosylation and the correlation between the peptide sequences around glycosylation sites and the glycan structures. This unknown correlation challenges the prediction of glycosylation sites on proteins, also limited by the scarcity of data. Indeed, the experimental determination of glycosylation sites is difficult to achieve as a large amount of purified proteins is needed and glycosylation has shown to be an organism- and tissue-specific process. 11 In recent years, remarkable effort has been made to boost both the discovery and mapping of O- and N-glycosylation sites. For example, a genetic engineering approach using human cell lines and developed by Clausen and colleagues 12 has enabled proteome-wide discovery of GalNAc-type O-glycosylation sites. As described in this work many previously unknown sites were identified. The systematic mapping of N-sites was undertaken earlier and gave rise for example, to UniPep, a database of human N-linked glycosites found on proteins isolated from plasma, cerebrospinal fluid, various tissues and cell sources. This database can be searched with different parameters and displays information such as the subcellular localization of the protein, the predicted N-linked glycosites and their location, the mass spectrometrically identified glycopeptides and their sequence along with relevant annotations and a full protein topology. 13 Currently, the database contains 1522 unique N-linked glycosites. Despite the clear value of these datasets and the development of new technology such as SimpleCell, the experimental determination of glycosylation sites only partially describes the overall process and explains the low level of full glycoprotein characterisation and annotation. To compensate for the lack of data, in-silico prediction has been and continues to be developed. Tools such as NetOGlyc, 14–16 NetNGlyc, 17,18 GlycoEP 19 and GlycoMine 20 rely on machine-learning techniques (neural network and Support Vector Machine) and are

3 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

trained to recognize the protein sequence context, the secondary structure and the accessibility of the glycosylation sites. Note that NetOGlyc and NetNGlyc have been available since the 1990’s and over the years the NetOGlyc server has considerably improved its sensitivity in particular following the massive information input provided by Clausen and colleagues. 12 More recently the GlycoEP server was developed to improve prediction models for N-, O- and C-linked glycosites in eukaryotic protein sequences. 19 The method used datasets of experimentally verified eukaryotic glycoproteins extracted from SWISS-PROT June 2011 release in which the redundancy was reduced at the level of protein sequence and of glycosite patterns. The GlycoMine project also proposes a novel bioinformatics approach to improve the prediction performance for all the three major types of glycosylation in human. 20 The method uses intensive feature selection techniques to choose and to integrate several informative features (e.g. sequence-based, structural and functional features) to predict the glycosylation sites in a protein of interest. The authors stated that this approach allows GlycoMine to outperform other tools including NetNGlyc and NetOGlyc. The improvement, mostly in term of the accuracy of prediction, brought by the tools described so far is important, but to the best of our knowledge these methods fail to take into consideration an important factor that is likely to largely contribute to constraining the peptide features of glycosylated sites and the outcome of the glycosylation process: the structure of the attached glycans. This oversight unfortunately reflects the scarcity of fully characterized glycoprotein structures. However, the release of new data in UniCarbKB 21 that upgraded GlycoSuiteDB 22 enables now investigating the relationship between features of glycosylation site and glycan structures. Preliminary studies were already performed in order to correlate for example, the N-glycan type with the accessibility of glycosylation sites on glycoproteins. 23,24 More recently, a quantitative and qualitative analysis of the glycoprofile of 474 N-linked glycosites on 169 mammalian N-glycoproteins, established a correlation between N-glycan structural motifs and the structure of the carrier glycosylation sites. This in-silico study highlights the contribution of solvent accessibility and physicochemical properties of

4 ACS Paragon Plus Environment

Page 4 of 33

Page 5 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

protein chains surrounding the glycosites. 25 Following a similar path, the web application presented in this paper and named GlycoSiteAlign has been developed to contribute to the assessment of this relationship while accounting for a variety of glycan properties and the amino acid sequence found in the glycosite vicinity. It is built upon the assumption that glycoproteomics will soon deliver new datasets but as this new data is still often limited to glycoprotein site associated with glycan composition, we designed GlycoSiteAlign to be queried with such fuzzy data. The tool is intended to help investigate the link between glycan structures and attachment site prior to revising prediction methods. GlycoSiteAlign groups and aligns amino acid regions of glycoproteins that surround glycosylation sites carrying glycans with selected structural features. In this way particular amino acid patterns/motifs (in addition to the well known N-X-S/T for N-glycosylation and W-X-X-W for C-mannosylation) can be more easily identified. GlycoSiteAlign does not predict glycosylation sites. It was designed as an exploratory tool bringing out the expected specificity that constrains the attachment of a particular glycan structure at a particular glycoprotein site. In the short term, the application will help investigate the mutual requirements conditioning glycan attachment to a glycoprotein. In the long run, observed regularities could be used to improve existing prediction tools or to develop a new one. GlycoSiteAlign is hosted on ExPASy, the bioinformatics resource portal of the SIB Swiss Institute of Bioinformatics, under the section dedicated to glycomics (http://www.expasy. org/glycomics). In GlycoSiteAlign the user can choose one or more glycan features and the application selects the group of proteins linked to the glycans with those features. The application isolates fragments of adjustable length (by default 20 amino acids) on each side of each glycosylated residue. These fragments are then aligned. The resulting alignments are returned as text files and svg images and also made available for download. These outputs can be investigated to find recurring amino acids patterns. The application will integrate new data as it is produced thereby supporting the gradual refinement of our understanding of glycosylation.

5 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

EXPERIMENTAL PROCEDURES Mode of operation of GlycoSiteAlign Data collection GlycoSiteAlign takes as input glycan structures and outputs aligned amino acid sequences surrounding glycosylation sites. To accomplish this, GlycoSiteAlign needs three types of data: glycan features, the position of glycosylation sites on glycoproteins and glycoprotein sequences (see diagram in Figure 1). The glycan features are retrieved from UniCarbKB (www.unicarbkb.org, release of 18th November 2015) and are available to the user as filtering options. They include: UniCarbKB identifier (Id), monosaccharide composition, glycan determinants, mass, several structural properties and core types. The full list is provided in Table 1. The glycosylation sites are retrieved from UniCarbKB as well, they are represented by the glycosylated amino acid(s), the UniProt IDs of the proteins in which they occur and the position in the corresponding sequences (Figure 1). The full sequences of glycoproteins are extracted from UniProtKB (www.uniprot.org, release 2015-11) and only portions surrounding glycosylation sites are selected (Figure 1).

The data is organized in dictionaries, a set of dictionaries associates a list of UniCarbKB glycan IDs to each glycan feature and a single dictionary associates these IDs with glycosylation sites and UniProtKB protein IDs. Another dictionary is built to relate the latter with glycoprotein sequences.

6 ACS Paragon Plus Environment

Page 6 of 33

Page 7 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 1: Diagram of the mode of operation of GlycoSiteAlign. Information about glycan features, glycosylated proteins and sites are retrieved from UniCarbKB while glycoprotein sequences from UniProtKB. For each feature, sequence fragments around glycosylation sites are selected and aligned, 90% homologous proteins can be added to the alignment.

Selection of protein sequence fragments Once the user has specified the desired glycan features and the fragment length n (by default 20), GlycoSiteAlign retrieves the corresponding UniCarbKB IDs, then the UniProtKB IDs, the glycosylation sites and the glycoprotein sequences on which the glycans are attached (see diagram in Figure 1). From the latter GlycoSiteAlign selects a portion of n amino acids on each side of each glycosylation site creating fragments to be aligned (Figure 1). The number of retrieved glycans IDs and, as a consequence, of the final glycoprotein fragments, depends on the type of the specified glycan features. Due to currently limited knowledge of glycan structures at specific sites, the more specific the feature, the fewer the glycosylation sites and the glycoprotein fragments. In particular UniCarbKB ID and Determinants options are the most specific features as reported in Table 1. Glycan composition and Mass range are two features allowing the user to be specific yet still flexible (Flexible search in Table 1). In contrast, Glycan structural properties and Core structures are considered as general glycan features (Coarse search in Table 1). The granularity of the search

7 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

need not be set. Mixed criteria can also be established in the advanced search page where glycan structures can be selected upon various and simultaneous features, for example an N-linked high-mannose that is both fucosylated and sialylated. Furthermore, to increase the amount of protein fragments and consequently the quality of the resulting alignments, GlycoSiteAlign accepts experimentally determined glycan composition(s) and the corresponding glycosylation sites as a direct input by the user. Table 1: Search categories and glycan features found in each category. SEARCH TYPE

FEATURES

Refined Search UniCarbKB Id

e.g. 4, 25, 96, 482, 541, 659, 705, 1031, 1119, 1231, 1315, 2859, 77344

Determinants

A (Type 2), Lewis x (SSEA-1), O-Mannose Lewis x, I antigen Di-sialyl T antigen, T (Tf antigen), Sda/CT antigen, a-Gal antigen Sialyl Tn antigen, Sialyl T antigen, HNK-1 antigen, Type 2 LN2 Polysialic acid, Type1 LN, 3-Sialyl-LN (type 2), Sialylated LDN HNK-1 on O-mannose, O-Linked mannose, LacdiNAc (LDN) Type 2 LN (N-LacNAc), Fucosylated LDN, 2,6-Branched O-mannose 4’-sulfated LDN, GM4

Flexible Search Composition

Hexose, HexNAc, DeoxyHexose, NeuAc, NeuGc, Pentose, Sulfate, Phosphate, HexA, Methyl, Acetyl, Other

Mass range

From 164 to 4548 Dalton

Coarse Search

Glycan structural properties

Fucosylated, Non-fucosylated, Core-fucosylated, Non-core fucosylated Bisecting N-acetylglucosamine core-fucosylated, Bisecting N-acetylglucosamine non-fucosylated Sialylated, Non-sialylated, Bisecting N-acetylglucosamine sialylated Bisecting N-acetylglucosamine non-sialylated, Bisecting N-acetylglucosamine

Core structures

N-linked Hybrid, Complex, High mannose, Truncated, Xylose, Hybrid-xylose, Complex-xylose, High mannose-xylose, Truncated-xylose, Not-available O-linked 0 (Tn antigen), 1 (T antigen), 2, Not-available N/O-linked core not-available O/C-linked core not-available

Advanced Search Combine glycan features

All the above listed features

Homologous protein sequence retrieval GlycoSiteAlign proceeds with the alignment only if at least 2 glycoprotein fragments are selected. In the other cases, to compensate for the lack of data, GlycoSiteAlign offers the 8 ACS Paragon Plus Environment

Page 8 of 33

Page 9 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

user the option to include in the alignment 90% homologous proteins. Indeed, UniProt entries include a field called “similar proteins” that cover the information on homology. The UniProt Reference Clusters 90 (UniRef90) is the result of clustering protein sequences using the CD-HIT algorithm 26 such that each cluster is composed of sequences that have at least 90% sequence homology. UniRef90 is included in the UniProt “similar protein” section and it is used to select 90% homologous proteins by GlycoSiteAlign. Starting from the IDs of the available glycoproteins (from now on called reference proteins), GlycoSiteAlign retrieves the IDs along with the sequences, of the 90% homologous proteins from UniProtKB and aligns them with the sequences of the reference proteins using a third-party clients for Python 2.7 provided by the European Bioinformatics Institute. 27 This service uses Clustal Omega 28 to do multiple sequence alignment. This step is necessary in order to match the position of each glycosylated site on the reference proteins with the positions of the hypothetical glycosylated sites on the homologous proteins. If a site position is conserved, GlycoSiteAlign selects amino acid fragments surrounding the site found on both the reference protein and the similar protein, else the non conserved site on the 90% homologous protein is discarded. In the current version, GlycoSiteAlign only takes into consideration a similarity (homology) of 90% to increase the probability of having conserved amino acids in the site positions. Alignment The total length of the fragments varies depending on the glycosylation types. If n is the number of selected amino acids (by default 20), in the event of O-linked sites the fragments will have a length of n + 1, since there is no consensus motif. In this case GlycoSiteAlign selects the n amino acids on the right side of the glycosylation site starting from the amino acid found immediately after the glycosylation position. In the case of N-linked sites, the fragment length will be n + 3 because the consensus sequence is composed of 3 amino acids (N-X-S/T). With respect to the previous case, there are two ”exceeding” amino acids and so GlycoSiteAlign selects the fragment on the right side of the glycosylation site, starting from

9 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 33

the 3th amino acid after the glycosylation position. In C-glycosylation the consensus motif has 4 amino acids (W-X-X-W) and so the fragments have a total length of n + 4 amino acids. GlycoSiteAlign selects the n-long fragment on the right side of the glycosylation site from the 4th amino acid after the glycosylation position. In order to constrain the alignment outside the consensus sequences, GlycoSiteAlign uses COBALT, a constraint-based alignment tool. This tool finds a collection of pairwise constraints derived from database searches, sequence similarity and user input then combines and incorporates them into a progressive multiple alignment. 29 The constraints that GlycoSiteAlign transfers to COBALT are the positions on the glycoprotein fragments of the consensus sequences characterizing each type of glycosylation site: position n + 1 for O-glycosylation, positions from n + 1 to n + 3 for N-glycosylation and positions from n + 1 to n + 4 for C-glycosylation. In this way, only the portions found on the left and right sides of the glycosylation consensus sequences are actually aligned. Output GlycoSiteAlign returns the aligned sequences in both text files and svg images. Each text file contains the UniCarbKB ID/s and the glycan structure/s of selected glycan/s at the top of the page (one line = one structure) followed by the aligned sequences (one line = one sequence) which are matched with the UniProtKB IDs, the position of the glycosylated amino acids with reference to the whole original sequence and the protein name. Svg images are made programmatically through Jalview, a system for viewing, editing and analyzing multiple sequence alignments. 30 The colour scheme CLUSTAL X has been chosen in order to paint and highlight whichever amino acid pattern occurring in alignments. Each residue in the alignment is assigned a colour if the amino acid profile of the alignment at that position meets some minimum criteria specific for the residue type. 31

10 ACS Paragon Plus Environment

Page 11 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Web Interface The web interface of GlycoSiteAlign is built using Flask, a BSD licensed micro web application framework written in Python and based on the Werkzeug toolkit and Jinja2 template engine. 32 The home page proposes a drop-down menu through which the user can choose the desired glycan features. These are divided in four categories as already shown in Table 1. In addition, a box can be checked to include 90% similar proteins in the alignments. The Go button validates the selection and prompts a new input page. Figure 2A shows an example of the Composition page. The user can choose either a precise number or a range of monosaccharides; if a field is not filled, the number is set to 0, while an asterisk will sets the number to any. The bigger the number of asterisks, the lesser specific the choice, and more glycans and glycoproteins will be selected and aligned. As mentioned before, users can enter their own experimentally determined glycan composition/s together with the protein sequence/s and glycosite/s uploading a fasta file or copying data in the corresponding box. Figure 2B shows the page for Glycan structural properties for the same example. The search by UniCarbKB ID prompts a page where glycan structure IDs of UniCarbKB can be directly input if known. Else, the page provides a link to the UniCarbKB website for querying the database and retrieving the proper IDs. The output page of GlycoSiteAlign displays the results of the alignments can be previewed and/or downloaded as text files and/or svg images.

11 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 33

RESULTS AND DISCUSSION General overview Table 2 summarizes the type and the number of glycan features retrieved from UniCarbKB, how many alignments without the inclusion of 90% homologous proteins arose for each feature and the number of glycan structures involved. In total 563 glycan structures, 220 glycosylation sites and 137 glycoproteins were recovered from UniProtKB 2015 release. Table 2: Number of considered glycan features, number of resulting alignments and of aligned glycan structures without 90% homologous proteins. GLYCAN FEATURE TYPE

FEATURES

ALIGNMENTS

ALIGNED STRUCTURES

UniCarbKB Id

563

140

264

Composition

238

159

488

Mass

234

160

485

Structural property

11

11

563

Core structure

10 N-linked 4 O-linked 1 N/O-linked 1 C/O-linked

9 4 1 1

562

Determinant

1 ABO Blood group 2 Lewis 9 Antigen 14 Others

1 2 8 13

547

N-glycosylation sites The alignments without similar proteins show mostly the presence of hydrophobic amino acids, glycine (G), cysteine (C) and P residues in specific positions around the glycosylation sites. In the Clustal-X color scheme chosen to graphically represent the alignments, these are highlighted as blue (hydrophobic residues and C residue) vertical and horizontal bands and as yellow and orange spots (for P and G residues respectively). 31 This is illustrated in Figures 3A and 4A showing the alignments of the “High-mannose” glycan with UniCarbKB 12 ACS Paragon Plus Environment

Page 13 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Id 1061 and of the “Core-fucosylated” glycan with UniCarbKB Id 1099 respectively. While the first shows mostly bands of single hydrophobic amino acids, the latter shows also amino acid pairs and trios.

O-glycosylation sites With respect to N-linked proteins the alignments without similar proteins show in most cases less blue bands and/or longer patterns but confirm an important presence of P residues all around the glycosylation sites and in particular at positions -1 and +3. In addition C and G residues are often associated with proline. For example the alignments of glycans with UniCarbKB IDs 656 and 657, glucose and fucose respectively, include Epidermal Growth Factor(EGF)-like repeats and TSP-1 repeats that are indeed rich in C and G residues. In particular, in the glycoprotein fragments associated with glycan 656, there are two EGFlike repeat patterns starting from position -2 to position +9 and other positions displaying a high degree of conservation (not shown). In glycan 657 (fucose) alignment there are two distinct types of amino acid patterns preceding or following the glycosylated amino acid. The first pattern NGG(T/S)C immediately upstream is associated mostly with the presence of a leucine (L) or a T at position +5 and is characteristic of EGF-like repeats. 33 The second pattern is a consensus sequence typically found in the TSP-1 domain: CSV (V=valine) preceding the glycosylation site and CGGG following 34 (Figure 5A). Interestingly the presence of C residues within these motifs and close to glycosylation sites (position -2 and +3 for glycan 656 and position +1 for glycan 657) seems to contradict previous observations that cysteine in these positions could prevent O-glycosylation. 4 Another interesting alignment corresponds to glycan “Core-2” that shows an abundance of charged amino acids (E=glutamic acid, R= arginine, K= lysine) at specific positions such as +1 +15, +19 and -14 (Figure 6A).

13 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 33

C-glycosylation sites For the last major type of glycosylation four glycosylated sites are found in the human protein TSP-1 and two in two other different glycoproteins. Similarly to the previous described alignments, they show mostly that specific positions are preferentially occupied by hydrophobic amino acids (not shown).

Introduction of homologous proteins N-glycosylation sites In most cases the introduction of the 90% homologous proteins to the alignments confirms the presence of the most frequent amino acids (usually hydrophobic and G and C residues) found in the alignments without similar proteins. However, in some cases, it brings to light longer patterns and/or amino acid residues not previously highlighted. Two examples of this are shown in Figure 3B and 4B. The first, the alignment of the “High-mannose” glycan with UniCarbKB ID 1061, reveals amino acid patterns of two and four residues preceding the glycosylation sites such as the pattern LLCL always associated with VF, and GG associated with SVFV, or the pattern GGYYVY (Y= tyrosine) not present in the alignment showed in Figure 3A. In Figure 4B red and magenta bands, not present in the Figure 4A, underline charged amino acids (E, D (aspartic acid), K and R) and blue-cyan strips other patterns such as the HVH trio. O-glycosylation sites The addition of 90% homologous proteins to the alignments has mostly the effect of increasing hydrophobic, charged amino acids and P and G residues around glycosylated sites. This brings out amino acid patterns that are usually found in EGF-like repeats and TSP1 domains. The Figure 5B illustrates this situation. On the left side at position -20, there is a

14 ACS Paragon Plus Environment

Page 15 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

preponderance of the charged amino acid D (magenta band) followed by TSP1 (more rarely) or EGF-like repeats (CSV(S/T)CG and NGG(S/T)C respectively). The latter being also associated with other amino acids than the D residue. The presence of P and G residues and of hydrophobic amino acids such as valine and alanine was recognised as a condition favouring O-glycosylation in many studies. 4 The alignment corresponding to “Core-0” shows with the introduction of similar proteins the presence of constrained amino acids around glycosylated sites. In this case charged residues at specific positions (+12, +15 and +16). “Core-2” shows a remarkable conservation of these types of residues (Figure 6B) that were present also in the alignment without similar proteins (Figure 6A). The both of them confirm hydrophobic amino acids in specific positions.

Comparison between broad and fine criteria of glycan selection Overall, the alignments corresponding to general glycan features (such as glycan structural properties, core structures, large mass range and composition) show less common amino acid residues and patterns between the sequence fragments than those produced with specific features (UniCarbKB ID, determinants, narrow mass range and composition). Two examples of alignment by UniCarbKB ID were shown in Figure 3 and Figure 4. In contrast, Figure 7 compares the alignment of N-linked sites corresponding to the general property “Core-fucosylated” (in A), with the alignments of N sites corresponding to the specific features “Type 1 LN” determinant (in B) and “UniCarbKB Id 2553” (in C). In both B and C, certain amino acids such as hydrophobic L, F, A, Y (tyrosine), W and V or of the charged D and E (in C) are obviously more abundant. However, in A the distribution of G and P around glycosylated sites is more dominant. Figure 8 compares the alignments of O-linked sites corresponding to the general glycan features “Core-1” (in A) and “Sialylated” (In B) with those of the precise glycan features “UnicarbKB ID 650” (in C) and the “Type 2 LN N-LacNAc” (in D) determinant. Similarly to N-linked sites in Figure 7, sequence fragments aligned according to more general features (in 15 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

A and B) do not show particular amino acid patterns or conservation of specific amino acids in multiple positions except for a great amount of P residues. To the contrary, alignments in C and D show the presence of charged (E, D, K and R) and hydrophobic amino acids in specific positions. The glutamic acid (E) residue at +1 in the alignment proposed in D and in -5 in the alignment in B, confirm previous observations stating a higher frequency of this amino acid in these positions. 4 With the current data collection, we have to point out that for the majority of the alignments corresponding to specific glycan features, the number of protein fragments is still very limited. This could in part explain the finding of more amino acids patterns and a bigger concentration of certain amino acids in specific positions in the alignments corresponding to finer glycan features. Nonetheless, for example, we verified that the addition of homologous proteins confirmed the absence or the presence of specific patterns corresponding to more general and more specific features respectively, in alignments shown in Figure 7 and in Figure 8. This situation arises in the majority of the alignments. Furthermore, the aligned fragments usually belong to a limited group of organisms (e.g. mammals) which reduces the coverage of the different life kingdoms and/or groups inside the same kingdom (e.g. plants or reptiles). To extend this coverage and increase the reliability of the alignments, more glycoprotein fragments are necessary. That is why updating the data regarding glycans and glycosylation sites is essential.

Contribution to the improvement of prediction In the previous section through the use of Jalview images, we highlighted how the alignments corresponding to specific glycan features correlate with specific amino acid residues around glycosylation sites and reinforce patterns found in alignments corresponding to general features. In the long term with a bigger amount of data this could contribute to increasing the accuracy and specificity of glycosylation site prediction thanks to a finer determination of the amino acid composition and frequency around glycosylated sites. To underline this 16 ACS Paragon Plus Environment

Page 16 of 33

Page 17 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

aspect, similarly to the authors of GlycoEP 19 , we represented the amino acid frequency of some alignments using frequency weblogos. The amino acids residues have been coloured according to their chemistry. Polar residues in green, neutral in purple, basic in blue, acidic in red and hydrophobic in black. Figure 9 shows the weblogos of all the N- and O-linked sites (general features) in A and D respectively, of N-sites attached to glycans with a mass included in the range 3000-3300 Da in B, of N-sites linked to the a-Gal antigen in C, of O-sites attached to the glycan determinant “Sialyl T antigen” in E and of O-sites occupied by the glycan with UniCarbKB Id 4 in F, all these intended as specific features. In the alignment of N- and O-linked sites, showed in A and D, there is a great variety in the type of residues in each position and the difference in their frequency (height of letters) is scarce. The alignment in B shows instead a lesser variety of the type of residues and in addition a greater difference between the frequency of the most common residues and the others than in the alignment of N-linked sites showed in A. This is even more visible for the alignment represented in C, where the variety of amino acid residues is even smaller than in B. With respect to the alignments in B and C, those represented in E and F, show a bigger difference in the frequencies between the different amino acids and a lesser variation in the amino acid residue types. As stated before, this is due to the smaller number of aligned fragments for O-linked sites. Nonetheless, in several positions the most frequent amino acid is the charged glutamic acid (E). In particular, it mostly precedes the glycosylated site at the positions from -2 to -7 (Figure 9E and F) or it is present at the positions -14, -17. -19, -20 (Figure 9D and F). In all cases GlycoSiteAlign helps distinguishing the various occurrences. We cannot rule out that specific patterns only reflect data biases. Possible patterns are only indicative. Again, the tool was designed to build potential explanations as opposed to producing established results.

17 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Conclusions The qualitative analysis of the results discussed in the previous section, shows as the comparison of amino acid regions carrying similar glycan structures can single out individual or patterns of amino acids otherwise difficult to highlight in wider alignments. Indeed, the more specific the aligned glycan feature the more likely it is to find these patterns and/or specific residues around glycosylation sites. Moreover, the introduction of 90% homologous proteins is helpful to display new motifs and/or confirm patterns displayed in the absence of these proteins. Some of the patterns highlighted by GlycoSiteAlign for example EGF-like repeats, PTS rich-regions and some hydrophobic amino acids such as valine and alanine, are already well known to contain and/or to favour O-glycosylation. 4,12,35 In contrast, other patterns revealed by GlycoSiteAlign and found close to glycosylation sites such as cysteine and charged residues, were considered having a negative influence on this type of glycosylation. More precisely cysteine at the positions +1 and +3 and glutamic acid at the position -6. 4 This observation emphasises the potential for GlycoSiteAlign to recognise amino acid patterns and/or residues usually “diluted” or masked in alignments that take into consideration only the glycosylation type (N- O- C-linked). The frequency weblogos we used to represent the alignments of all N- and O- sites and of some specific glycan features, confirmed this “dilution” effect and accentuated the ability of finer alignments to emphasise patterns. We acknowledge that our results are in part explained by the fact that the more specific the glycan feature the less glycoprotein fragments are aligned. Nonetheless, the introduction of 90% homologous proteins often confirms the absence or the presence of patterns in more general and more specific features respectively. This could indicate a real tendency of glycan structures with distinct features to have a specific affinity for particular amino acid sequences. To clarify this point and confirm that structure-based alignments could improve pattern recognition and in turn prediction methods, further alignments with both a larger number of glycoprotein fragments and a finer subdivision of the glycan structures are needed. GlycoSiteAlign will then be regularly updated with new data coming from glycoproteomic 18 ACS Paragon Plus Environment

Page 18 of 33

Page 19 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

research, a field becoming increasingly productive. The next dataset in preparation that will soon be integrated, relates to N-glycans of immunoglobulins. More generally, GlycoSiteAlign is tuned on the evolution of UniCarbKB content. However, the option of including user-input experimentally determined glycan structures and glycosylation sites in the alignment produced by GlycoSiteAlign is yet another means of compensating the current lack of data and potentially account for data not stored in UniCarbKB. GlycoSiteAlign is also limited by the bias in experimental data that are not necessarily representative of the entire pool of known glycosylation sites and glycoproteins. There is for instance more data of full structures on N-sites than on O-sites simply because the former is simpler to generate than the latter. Moreover, glycosylation sites extracted from 90% similar proteins are inferred following sequence alignment and only rely on the conservation of a few residues. To improve the usefulness of GlycoSiteAlign as an exploratory tool and possibly a future prediction tool, in the next release we are planning to include 2D and 3D structural constraints to define glycosylation sites. Indeed, as described already in several studies, it seems that the primary structure of the glycosylation sites is necessary, but not sufficient to determine if a site will be glycosylated. 36,37 With the introduction of 3D structure in GlycoSiteAlign we aim to produce more reliable alignments and to deepen the knowledge of the relationship between specific glycan structures and glycosylation site features. Such refinement could also be achieved by accounting for information on site heterogeneity but quantitative data remain limited in present glycoproteomics studies. Finally, GlycoSiteAlign can serve MS data analysis. For example, the search by glycan composition and mass, can be used to quickly investigate which kind of glycoproteins carries a certain glycan structure with experimental masses. The search by mass can also be useful to assess if a protein is glycosylated in case of a discrepancy between a predicted and an experimentally determined mass. In conclusion, GlycoSiteAlign shows potential as a tool to assess the constraints conditioning

19 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

the relationship between glycan structures and glycoprotein sites. Using the tool may help in the understanding of glycoproteins and glycan functions and their physiological and/or pathological roles therefore as a contribution to the field of functional glycoproteomics.

Acknowledgments We thank our UniCarbKB collaborators, Nicki Packer for fruitful discussions and Matthew Campbell for his help on using the data. This work is supported by the Swiss Federal Government through the State Secretariat for Education, Research and Innovation SERI. ExPASy is maintained by the web team of the Swiss Institute of Bioinformatics and hosted at the Vital-IT Competency Center.

References (1) Taylor, C. M.; Karunaratne, C. V.; Xie, N. Glycosides of hydroxyproline: some recent, unusual discoveries. Glycobiology 2012, 22, 757–767. (2) Trinidad, J. C.; Schoepfer, R.; Burlingame, A. L.; Medzihradszky, K. F. N-and Oglycosylation in the murine synaptosome. Molecular & Cellular Proteomics 2013, 12, 3474–3488. (3) Wilson, I.; Gavel, Y.; Von Heijne, G. Amino acid distributions around O-linked glycosylation sites. Biochemical Journal 1991, 275, 529–534. (4) Christlet, T. H. T.; Veluraja, K. Database analysis of O-glycosylation sites in proteins. Biophysical journal 2001, 80, 952–960. (5) O’Connell, B.; Tabak, L. A.; Ramasubbu, N. The influence of flanking sequences on Oglycosylation. Biochemical and biophysical research communications 1991, 180, 1024– 1030. 20 ACS Paragon Plus Environment

Page 20 of 33

Page 21 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(6) Nehrke, K.; Ten Hagen, K. G.; Hagen, F. K.; Tabak, L. A. Charge distribution of flanking amino acids inhibits O-glycosylation of several single-site acceptors in vivo. Glycobiology 1997, 7, 1053–1060. (7) Thaysen-Andersen, M.; Larsen, M. R.; Packer, N. H.; Palmisano, G. Structural analysis of glycoprotein sialylation–Part I: pre-LC-MS analytical strategies. RSC Advances 2013, 3, 22683–22705. (8) Doucey, M.-A.; Hess, D.; Blommers, M. J.; Hofsteenge, J. Recombinant human interleukin-12 is the second example of a C-mannosylated protein. Glycobiology 1999, 9, 435–441. (9) Hofsteenge, J.; Blommers, M.; Hess, D.; Furmanek, A.; Miroshnichenko, O. The Four Terminal Components of the Complement System AreC-Mannosylated on Multiple Tryptophan Residues. Journal of Biological Chemistry 1999, 274, 32786–32794. (10) Furmanek, A.; Hofsteenge, J. Protein C-mannosylation: facts and questions. Acta biochimica polonica 1999, 47, 781–789. (11) Croset, A.; Delafosse, L.; Gaudry, J.-P.; Arod, C.; Glez, L.; Losberger, C.; Begue, D.; Krstanovic, A.; Robert, F.; Vilbois, F.; Chevaleta, L.; Antonssona, B. Differences in the glycosylation of recombinant proteins expressed in HEK and CHO cells. Journal of biotechnology 2012, 161, 336–348. (12) Steentoft, C. et al. Precision mapping of the human O-GalNAc glycoproteome through SimpleCell technology. The EMBO J 2013, 32, 1478–1488. (13) Zhang, H. et al. UniPep-a database for human N-linked glycosites: a resource for biomarker discovery. Genome biology 2006, 7, R73. (14) for Glycomics, C. C. NetOGlyc 4.0 Server. 2013; http://www.cbs.dtu.dk/services/ NetOGlyc/, [Online; accessed 22-March-2016]. 21 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 33

(15) Hansen, J. E.; Lund, O.; Tolstrup, N.; Gooley, A. A.; Williams, K. L.; Brunak, S. NetOglyc: prediction of mucin type O-glycosylation sites based on sequence context and surface accessibility. Glycoconjugate journal 1998, 15, 115–130. (16) Gupta, R.; Brunak, S. Prediction of glycosylation across the human proteome and the correlation to protein function. Pac. Symp. Biocomput. 2001; pp 310–322. (17) for Glycomics, C. C. NetNGlyc 1.0 Server. 2014; http://www.cbs.dtu.dk/services/ NetNGlyc/, [Online; accessed 22-March-2016]. (18) Julenius, K.; Mølgaard, A.; Gupta, R.; Brunak, S. Prediction, conservation analysis, and structural characterization of mammalian mucin-type O-glycosylation sites. Glycobiology 2005, 15, 153–164. (19) Chauhan, J. S.; Rao, A.; Raghava, G. P. In silico platform for prediction of N-, O-and C-glycosites in eukaryotic protein sequences. PloS one 2013, 8, e67008. (20) Li, F.; Li, C.; Wang, M.; Webb, G. I.; Zhang, Y.; Whisstock, J. C.; Song, J. GlycoMine: a machine learning-based approach for predicting N-, C-and O-linked glycosylation in the human proteome. Bioinformatics 2015, btu852. (21) Campbell, M. P.; Peterson, R.; Mariethoz, J.; Gasteiger, E.; Akune, Y.; AokiKinoshita, K. F.; Lisacek, F.; Packer, N. H. UniCarbKB: building a knowledge platform for glycoproteomics. Nucleic acids research 2013, gkt1128. (22) Cooper, C. A.; Joshi, H. J.; Harrison, M. J.; Wilkins, M. R.; Packer, N. H. GlycoSuiteDB: a curated relational database of glycoprotein glycan structures and their biological sources. 2003 update. Nucleic Acids Research 2003, 31, 511–513. (23) Go, E. P.; Irungu, J.; Zhang, Y.; Dalpathado, D. S.; Liao, H.-X.; Sutherland, L. L.; Alam, S. M.; Haynes, B. F.; Desaire, H. Glycosylation Site-Specific Analysis of HIV Envelope Proteins (JR-FL and CON-S) Reveals Major Differences in Glycosylation 22 ACS Paragon Plus Environment

Page 23 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Site Occupancy, Glycoform Profiles, and Antigenic Epitopes Accessibility. Journal of proteome research 2008, 7, 1660–1674. (24) Harazono, A.; Kawasaki, N.; Itoh, S.; Hashii, N.; Ishii-Watabe, A.; Kawanishi, T.; Hayakawa, T. Site-specific N-glycosylation analysis of human plasma ceruloplasmin using liquid chromatography with electrospray ionization tandem mass spectrometry. Analytical biochemistry 2006, 348, 259–268. (25) Thaysen-Andersen, M.; Packer, N. H. Site-specific glycoproteomics confirms that protein structure dictates formation of N-glycan type, core fucosylation and branching. Glycobiology 2012, 22, 1440–1452. (26) Li, W.; Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22, 1658–1659. (27) EBI, Clustal omega (REST). 2011; http://www.ebi.ac.uk/Tools/webservices/ services/msa/clustalo_rest, [Online; accessed 21-March-2016]. (28) Sievers, F.; Wilm, A.; Dineen, D.; Gibson, T. J.; Karplus, K.; Li, W.; Lopez, R.; McWilliam, H.; Remmert, M.; S¨oding, J.; Thompson, J. D.; Higgins, D. G. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular systems biology 2011, 7 . (29) Papadopoulos, J. S.; Agarwala, R. COBALT: constraint-based alignment tool for multiple protein sequences. Bioinformatics 2007, 23, 1073–1079. (30) Waterhouse, A. M.; Procter, J. B.; Martin, D. M.; Clamp, M.; Barton, G. J. Jalview Version 2a multiple sequence alignment editor and analysis workbench. Bioinformatics 2009, 25, 1189–1191. (31) Jalview, Jalview Clustal X Colour Scheme. 2016; http://www.jalview.org/help/ html/colourSchemes/clustal.html, [Online; accessed 3-March-2016]. 23 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(32) Ronacher, A. Flask, web development, one drop at a time. 2013; http://flask.pocoo. org/, [Online; accessed 3-March-2016]. (33) Davis, C. G. The many faces of epidermal growth factor repeats. The New biologist 1990, 2, 410–419. (34) Lawler, J.; Hynes, R. O. The structure of human thrombospondin, an adhesive glycoprotein with multiple calcium-binding sites and homologies with several different proteins. The Journal of Cell Biology 1986, 103, 1635–1648. (35) Shao, L.; Luo, Y.; Moloney, D. J.; Haltiwanger, R. S. O-glycosylation of EGF repeats: identification and initial characterization of a UDP-glucose: protein Oglucosyltransferase. Glycobiology 2002, 12, 763–770. (36) Zielinska, D. F.; Gnad, F.; Wi´sniewski, J. R.; Mann, M. Precision mapping of an in vivo N-glycoproteome reveals rigid topological and sequence constraints. Cell 2010, 141, 897–907. (37) Lam, P. V. N.; Goldman, R.; Karagiannis, K.; Narsule, T.; Simonyan, V.; Soika, V.; Mazumder, R. Structure-based comparative analysis and prediction of N-linked glycosylation sites in evolutionarily distant eukaryotes. Genomics, proteomics & bioinformatics 2013, 11, 96–104.

24 ACS Paragon Plus Environment

Page 24 of 33

Page 25 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 2: Collage of the screenshots of two selection pages of GlycoSiteAlign. A: composition page where users can select a glycan composition from 12 monosaccharides and provide their own glycosites. B: glycan structural properties page where 6 different structural features can be selected by the user. In both pages there are extra boxes to choose the type of protein linkages: N- O-, C-.

25 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3: Alignment corresponding to the N-linked “High-mannose” glycan with UniCarbKB Id 1061. A: alignment without similar proteins. B: alignment with similar proteins. To notice the appearance of short amino acids patterns such as LLCL (L=leucine), HAV (H=histidine, A=alanine and V= valine) and VFV (F= phenylalanine). The alignment was cut out, for the whole image we refer the reader to the application site.

26 ACS Paragon Plus Environment

Page 26 of 33

Page 27 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 4: Alignment corresponding to the N-linked “Core-fucosylated” glycan with UniCarbKB ID 1099. A: alignment without similar proteins. B: alignment with similar proteins. It shows the increase of the short amino acid pattern HVH already present in A and the appearance of charged amino acids bands stained in magenta and red. The alignment was cut out, for the whole image we refer the reader to the application site.

27 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 5: Alignment corresponding to the O-linked “Non-core-fucosylated” glycan with UniCarbKB ID 657. A: alignment without similar proteins. It shows the EGF-like repeat pattern NGG(T/S)C and the TSP1 repeat pattern CSV preceding and the CGGG following the glycosylated amino acid. B: alignment with similar proteins. It shows how the addition of 90% homologous proteins strengthens patterns in A and reveals other patterns (e.g. the magenta band on the left). The alignment was cut out, for the whole image we refer the reader to the application site.

28 ACS Paragon Plus Environment

Page 28 of 33

Page 29 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 6: Alignment corresponding to “Core-2” O-linked glycans. A: alignment without similar proteins. It shows the presence of charged amino acids E (magenta band), R and K (red bands) preceding and following the glycosylated amino acid found at position 21. B: alignment that includes similar proteins. It shows how the addition of 90% homologous proteins confirms and singles out charged amino acids in specific positions (magenta and red bands) and displays other patterns composed of hydrophobic amino acids (blue band) and H (histidine). The alignment was cut out, for the whole image we refer the reader to the application site.

29 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 7: Comparison of N-linked sites alignments corresponding to general and specific glycan features. A: alignment of sites attached to “Core-fucosylated” glycans (general feature). It shows the absence of amino acid patterns and the presence of just a blue band on the left side. The alignment was cut out, for the whole image we refer the reader to the application site. B: alignment corresponding to the “Type 1 LN” glycan determinant. C: alignment of the “Core- fucosylated” glycan with UniCarbKB Id 2553. These two specific alignments show a stronger presence of certain amino acids in specific positons (blue bands and blue and magenta spots) with respect to the alignment in A.

30 ACS Paragon Plus Environment

Page 30 of 33

Page 31 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 8: Comparison of O-linked sites alignments corresponding to general and specific glycan features. A: alignment of sites attached to “Core-1” glycans (general feature). It shows the absence of amino acid patterns and the presence of just a blue band on the right side. B: alignment of sites attached to “Sialylated” glycans (general feature). As in A there is no particular pattern except for an important presence of proline and a blue band on the right side of the glycosylated site. The alignments in A and B were cut out, for the whole images we refer the reader to the application site. C: alignment corresponding to the glycan with UniCarbKB ID 650. It shows the charged and hydrophobic amino acids in specific positions. D: alignment of the protein fragments attached to the “Type2 LN NLacNAc” glycan. As in C there is a stronger presence of certain amino acids in specific positons (blue, magenta and red bands) with respect to the alignment in A and B.

31 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 9: Comparison of frequency weblogos between general and specific glycan fetaures. A and D: alignments of N- and O-linked sites respectively (general features). It shows the high variation of the type of amino acid residues around glycosylation sites. B: alignment of N-sites attached to glycans with a mass included between 3000 and 3300 Da. C: alignment corresponding to the glycan determinant “a-Gal antigen” (N-linked). E: alignment corresponding to the glycan determinant “Sialyl T antigen” (O-linked). F: alignment of sites attached to glycans with UniCarbKB Id 4 (GalGalNAc, O-linked). All of these are specific features. With respect to A, in B and C there is less variation in the type of amino acids in each position along the sequence fragments, especially in the alignment shown in C. Moreover the difference in the height of letters (amino acid frequency) is more pronounced. The same can be observed in E and F with respect to D.

32 ACS Paragon Plus Environment

Page 32 of 33

Page 33 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

for TOC only 85x47mm (300 x 300 DPI)

ACS Paragon Plus Environment