Mining for Small Translated ORFs - Journal of Proteome Research

Nov 30, 2017 - Peptides encoded by short open reading frames (sORFs) are usually defined as peptides ≤100 aa long. Usually sORFs were ignored by aut...
1 downloads 7 Views 1MB Size
Subscriber access provided by University of Florida | Smathers Libraries

Review

Mining for small translated ORFs Anastasia Chugunova, Tsimafei Navalayeu, Olga Dontsova, and Petr Sergiev J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.7b00707 • Publication Date (Web): 30 Nov 2017 Downloaded from http://pubs.acs.org on December 1, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Mining for small translated ORFs Anastasia Chugunova 1,2, Tsimafei Navalayeu 1, Olga Dontsova 1,2, Petr Sergiev 1,2* 1

Lomonosov Moscow State University, Department of Chemistry and A.N. Belozersky Institute of Physico-

Chemical Biology, Moscow, 119992, Russia 2

*

Skolkovo Institute of Science and Technology, Skolkovo, Moscow region, 143025, Russia

Corresponding author, +7 495 9395418, [email protected]

Abstract Peptides encoded by short open reading frames (sORFs) are usually defined as peptides smaller than or equal to 100 aa long. Usually sORFs were ignored by automatic genome annotation programs due to the high probability of false discovery. However, improved computational tools along with a high-throughput RIBO-seq approach identified a myriad of translated sORFs. Their importance becomes evident as we are gaining experimental validation of their diverse cellular functions. This review examines various computational and experimental approaches of sORFs identification, as well as providing the summary of our current knowledge of their functional roles in cells. Keywords: peptide, small ORF, uORF, lncRNA, ribosome profiling, RIBO-seq, translation, coding potential, genome annotation, small peptide Abbreviations: open reading frame (ORF); short open reading frame (sORF); untranslated region (UTR); long non-coding RNA (lncRNA); circular RNA (circRNA); micro RNA (miRNA); ribosomal RNA (rRNA); ribosome protected fragments (RPF); 5’-UTR sORF (uORF). Introduction The adage “everything new is just well-forgotten old” is particularly true regarding a novel class of bioactive peptides encoded by short open reading frames (sORFs). Even though studies in this field began decades ago, now it is becoming increasingly obvious that the diversity of these biologically active molecules was severely underestimated. The main reason for the lag between protein and peptide gene identification was the small size of peptides that precluded automatic annotation of their ORFs. However, over the past few decades an increasing variety of peptides smaller than 150 amino acids have been identified in various organisms, from bacteria to humans1-10. The functional diversity of these peptides as well as improvements of the toolbox used for their identification attracts an increased attention of the 11-13

scientific community

. All these studies indicated that peptides translated from sORFs act as eminent

regulators in many vital processes, such as metabolism4,10, endocytosis14, immune surveillance15,16, development

17,18

9

, and cell death . In this review, we will summarize our current understanding of sORF-

encoded peptide functions as well as consider modern approaches to sORF identification.

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 23

Methods of sORF detection Currently, identification of translated sORF is based on three broad approaches: sequence analysis by computational methods, ribosome profiling and mass-spectrometry. sORF prediction by sequence analysis Bioinformatic analysis of a genome sequence is a first step in prediction of sORF existence, but it is still challenging to achieve high sensitivity and specificity of prediction. Several questions arise, and the most important one is where, if at all, to set a peptide length threshold? sORFs can occur in any given sequence just by chance. Thus, countless putative open reading frames may be found in any genome, while the overwhelming majority of them are never translated, or fulfill any function. Such “non-existing” peptides are usually very small due to the high chance of encountering a stop codon in any random nucleotide sequence. The general ORF length cutoff, for computational methods of protein coding sequence prediction, was 100 amino acids19, since only a small fraction of known functional proteins are 20

shorter than 100 amino acids . However, as convenient this guideline is, it leaves out small peptide whose genes are expressed, and which do have functional roles in vivo. Since functional sORF could be as small as two codons21 the utility of a length threshold for gene annotation is questionable. Usage of alternative start codons is another major problem that makes discovery of new sORF even more complicated. It was previously noticed that some proteins start with a non-AUG initiation codon22. Thus, searching for novel functional peptides it is almost like “fishing in the dark”. Despite these issues, many computational approaches were established to distinguish between expressed functional sORFs and purely hypothetical ones constituting a predominant majority23 (Table 1). A recent review gives a 24

comprehensive digest of most such bioinformatic programs . How can a “needle” of functional sORFs be identified in a “haystack” of hypothetical ones? Today, many prediction tools employ a combination of diverse features (Figure 1) for discrimination coding sequences 25

from non-coding, but most often they rely on the same common, general principles . These include 1) the identification of a conserved sORF by its sequence comparison between different species (Figure 1A); 2) analysis of codon usage and other characteristic features of the coding regions within sORF sequences (Figure 1B); 3) assessing its sequence similarity to previously identified proteins or some functional domains (Figure 1C-D). Conservation is a substantial factor that assists in finding functional sequences. Usually, proteins and peptides are products of evolution and thus, their sequences should be more conserved than the bulk non-coding DNA which is not under the pressure of selection. Different metrics might be applied to determine a protein coding gene (Figure 1A), starting from the simple conservation of nucleotide sequence26. Additional support for the functionality of a given potential sORF may be provided by the dN/dS metric reflecting the prevalence of synonymous versus non-synonymous substitutions in particular 27,28

reading frame

. To estimate (potential) coding sequence conservation, more precise metric might be

used, taking into account likelihoods of particular types of codon substitutions at given phylogenetic 29

distance . The similarity to known protein domains (Figure 1D) and proteins (Figure 1C) may also help, as protein sequences are frequently composed of domains and motifs acquired from other proteins. However, both methods have their shortcomings. Finding orthologs might be counterproductive for newly

ACS Paragon Plus Environment

Page 3 of 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

emerging and yet functional entities30. Also, whole genome sequences and alignments are not available for all related species. Verification of sORF coding potential based on sequence similarities with known proteins is limited by ORF length, since this method is size-dependent and should span a number of conserved positions31. In general, for shorter patches of conserved sequences more species should be 32

used for comparative genome analysis to obtain reliable statistical power . Another approach to predict the functionality of a putative sORF is to analyze its nucleotide and codon composition (Figure 1B)33, 34. As a fragment of meaningful written text differs from a random set of letters, a similar difference in compositional statistics applies to functional protein coding sequences. Parameters as simple as nucleotide frequencies are different for coding and non-coding DNA, e.g. human coding 35

sequences are generally more GC-rich than noncoding . Likelihood of a sequence to be coding may be assessed more precisely using periodicity in nucleotide frequencies, since the first, the second and the third nucleotides of codons have different average composition34. Finally, codon and amino acid preference metrics, codon pair preferences and hidden Markov models could be used (Figure 1B)

34,36

.

Comparative genomics approaches might be combined with ab initio coding region identification using nucleotide periodicities, codon and codon pair frequencies to increase the precision of prediction37. Experimental methods are needed to support translated sORF predictions made purely by computational sequence analysis. For non-conserved or newly emerging functional sORFs experimental methods are indispensible for making decisive conclusions.

Ribosome profiling In 2009, a genome-wide experimental approach that enables direct detection of sORF was invented38. Ribosome profiling is, essentially, next generation sequencing of ribosome footprints, or ribosome protected fragments of mRNA, which allows mapping the location of all translating ribosomes in a cell (Figure 2A). Subsequent analysis of a cell translatome revealed that many RNAs considered as lacking of 39-41

coding potential, in fact, are actively translated, including long noncoding RNAs

and 5′- and 3’-UTRs

of known genes. At a first glance ribosome profiling data revealed that translation like transcription has a 42

pervasive nature . First results of ribosome profiling ascribed a translation capacity to many non-coding RNAs41 resemble a pendulum that has swung too far in the direction opposite to the total neglecting of such a possibility. 43

Later examination corroborated untranslated status for many non-coding RNAs . Only 0.4% fraction of putative lincRNA translation products could be confirmed experimentally with the help of mass23

spectrometry . What is detected by ribosome profiling might be not only translation itself, but rather association

with

ribosomes,

or

in

some

cases

perhaps

even

co-purification

of

unrelated

ribonucleoproteins with ribosomal fractions. E.g., scanning 40S ribosomal subunits could contribute to 44

such footprints . Both experimental improvements in the ribosome profiling technique and advances in computational analysis of the data contributed to an increase in precision of translated sORF detection by ribosome profiling. Among technical advancements is usage of harringtonine or lactimidomycin to stall ribosomes at the start codon

45,46

to classify translation start sites and affinity isolation of translating

ribosomes to avoid co-purification of unrelated ribonucleoproteins with the ribosomes which may happen 42,47,48

if ultracentrifugation is used

. Additionally, Poly-Ribo-Seq method was suggested, where only

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 23

polysomes that represent active translation are isolated and used for footprinting49. Translation complex profile sequencing (TCP-seq) method made use of formaldehyde cross-linking to isolate small ribosomal subunit associated mRNA fragments in addition to conventional 80S-protected footprints which resulted in information of both initiation and elongation of translation50. Computational methods of ribosome profiling data analysis (Figure 2B-E) aiming to detect sORF translation (Table 2) rely on several specific features of ribosome footprints, used in different combinations. The most widely used property of ribosome footprints is three nucleotide periodicity (Figure 30,51-55

2C,D), resulting from the codonwise ribosome movement during translation periodicity allows detection of frame shifting events

. Careful usage of this

and overlapping ORF translation56. Ribosome

protects on average 31 nt. of mRNA. The distribution of ribosome footprint lengths across the putative sORF (Figure 2E) might help to distinguish translated regions from scanned ones, occupied by posttermination sliding ribosomes and those RNA fragments that are products of RNAse digestion of large 42

ribonucleoprotein complexes co-sedimenting with ribosomes by chance . Finally, ribosome footprints have a specific distribution along the transcript. The most straightforward expectation is a sharp decrease in ribosome occupancy after in-frame stop codon (Figure 2B), which is utilized in a number of methods for detection of translated sequences

39,43

. A decrease in ribosome occupancy following in-frame stop codon

may work less efficiently for prediction of overlapping translated ORFs encoded in the same transcript, as termination of translation in one frame would not decrease occupancy by ribosomes translating an overlapping ORF in another frame (Figure 2B, C, D, E). Many studies take into consideration the footprint coverage ratio between ORF and untranslated regions39,52. At more detailed level, differential ribosome density of 5’-UTR, 3’-UTR, coding region and regions of transition between coding and non-coding areas 51

may be taken into account . Nowadays, several helpful resources sORF.org57 and GWIPS-viz58 were created for exploring sORFs existence. sORF.org (http://www.sorfs.org) is a user-friendly database of sORF identified by the ribosome profile in three different cell lines HCT116 (human), E14_mESC (mouse) and S2 (fruit fly) for researchers with limited bioinformatics knowledge. The GWIPS-viz (Genome Wide Information on Protein Synthesis visualized) browser (http://gwips.ucc.ie) allows researchers to analyze alternative proteoforms using the genomic alignments of ribosome profile data and corresponding mRNA-seq data along with relevant annotation tracks. Recently useful Web-based resources were developed to integrate entire ribosome profile analysis pipeline, such as Ribogalaxy59. Despite all the advantages of the ribosome profiling, its accuracy ultimately depends on the accuracy of experimental dataset60. There are still examples where sORFs characterized as protein coding, according 26

to one protocol , appears to lack ribosome-protected fragments according to another

18,39

. Not less

important factor for accuracy is subsequent data analysis that also leads to different conclusions39,43. However, no matter how powerful the ribosome profiling method is, it might be useful to complement it with another independent technique.

Mass-spectrometry

ACS Paragon Plus Environment

Page 5 of 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Mass spectrometry is a direct method of peptide detection. Predominantly, information about proteome composition of a cell is obtained by ‘shotgun proteomics’, which applies liquid chromatography (LC) followed by tandem mass spectrometry (MS/MS) for identification of either natural peptides or hydrolyzed fragments of total or fractionated proteins61 (Figure 3A). Identification of peptides from acquired MS/MS spectra is achieved by matching them against theoretical spectra of all candidate peptides represented in a reference protein sequence database, most commonly Ensembl, RefSeq or UniProtKB62 (Figure. 3B, C). A drawback of this strategy is that many peptides are not presented in a particular reference database due to polymorphous sites in peptide coding genes, alternative splice forms or lack of annotation. The generation of a customized database may solve this problem. There are several strategies to achieve this, and the most obvious one is six-frame translation of the entire genomes. Unfortunately, such a dataset is difficult to use, due to its extremely large size, and the huge presence of non-existing protein sequences63,64. Another way is to create a smaller database by translation of EST (Expressed Sequence 65

Tag) data . But it is still substantially large. Its reduction may apply three-frame translation of annotated RNA transcript data, which already contains experimental confirmation of transcription23,66. Out-of-frame peptides and alternative translation initiation sites can be also identified using this database. Similar to the case of ribosome profiling, sample preparation for mass spectrometry can dramatically impact the results. Recently, it has been shown that various workflows influences on short peptide discovery67. Particular advancement is Electrostatic Repulsion Hydrophilic Interaction Chromatography (ERLIC) method, which allows identification of 90–94 sORF encoded peptides per run, compared to only 13-19 without ERLIC. Recently, a pipeline for the analysis of both ribosome profiling and mass-spectrometry data for the same sample was reported68. An integration of such different types of data in a single program package may present a way for further development of sORF identification techniques.

Where to find sORFs Functional roles of sORFs are enormously diverse. They can be subdivided into the two main, sometimes overlapping categories. Small peptides encoded within sORF may function as independent biochemical entities18,69-73. Alternatively, sORF translation, but not its peptide product might be essential for the regulation of expression of nearby protein coding ORF

74-76

. sORF species may also be classified by their

location relative to other genes (Figure 4). In the following paragraphs, we describe each class separately and present an up-to-date overview of their functions.

5’-UTRs Short ORFs located in the 5’ untranslated region of mRNAs are called uORFs. Around 40% of 77

mammalian 5’ UTRs contain uORFs, exemplifying their genome-wide prevalence . Some of them are conserved across different species78,79, and many uORF-altering mutations lead to genetic disorders and diseases80,81, highlighting their importance. In many known cases80,82, but not exclusively83 uORFs are involved in translation control of downstream CDS. uORFs can reduce the amount of protein encoded downstream through modulation of translation efficacy84, or by triggering mRNA decay85-87. However,

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 23

under stress conditions they could positively regulate protein levels88,89. The subset of uORF influences ribosome functions in response to small molecules, causing ribosome stalling at the stop codon75,76,90-92. Intriguingly, 5’-UTR-encoded peptides could also have functions aside from translation regulation. For instance, in bacteria the 5’-part of SgrS RNA (sugar transport-related sRNA) codes for a short peptide 1

(SgrT) that inhibits glucose uptake via direct binding to the glucose transporter . Another example is the highly-conserved human uMKKS1 and uMKKS2 open reading frames, located in the 5’ region of MKKS (from McKusick-Kaufman syndrome) mRNA. uMKKS1 and uMKKS2 are translated from a short transcript 93

and subsequently integrate into the mitochondrial membrane in cultured human cells , but their function remains obscure. One more example is a recent discovery of uORF in the 5’ leader sequence of angiotensin type 1a receptors (AT1aR), which interferes with the non G-protein coupled signaling pathway activated by angiotensin II. It was proposed that that peptide binds to AT1aR at the allosteric site, which differs from 14

the angiotensin II binding site, and the complex is then internalized via receptor mediated endocytosis . Overall, translation of 5’-UTRs appears to not only be a regulatory mechanism for downstream gene expression control, but also a source of peptides that have independent function in the cell. Overlapping ORFs It was shown that a single mRNA can be translated in different frames to yield completely dissimilar amino acid sequences

94,95

. In 1996, it became clear that the melanoma antigen gp75 codes for two

completely different polypeptides, gp75 and a 24 aa peptide; the latter serves as a tumor rejection antigen, recognized by T cells16. Since then it was discovered that several human tumor rejection 96-98

antigens are produced from alternative open reading frames, and that their lengths vary

.

Interestingly, this phenomenon is common not only for cancer cells. Several pairs of reference ORF/alternative ORF were discovered, such as: INK4a/ARF99, histone H4/OGP100, XLalphas/ALEX101,102, PrP/altPrP

103

104

, ATXN1/altATXN1

. Bioinformatic analysis shows that around 41% of human mRNAs

contain at least one alternative ORF within the reference, and the majority of the predicted ORFs represent proteins smaller than 90 aa in length105. Additionally, several such proteins produced from BDH2, NIPA1, SCARB2, LGALS3BP, VEGFC, and p53, were detected by mass-spectrometry. Their diverse subcellular localizations speak in favor of a variety of possible functions associated with such alternative protein variants

105

.

3’-UTRs Whereas multiple examples of sORFs in 5’-UTR sequences are discussed in scientific literature, those in 3’-UTRs seem to attract significantly less attention, since it is widely believed that 3’-UTRs cannot be translated. In reality, little is known about sORFs in the 3’ untranslated region106. Sometimes ribosomes are found to be associated with mRNA 3’-UTR, but it is considered to be a result of stop-codon readthrough

107,108

109

or delayed ribosome drop-off after the termination of translation

. Several studies argue in

favor of 3’-UTR translation. Utilizing ribosome profiling and mass-spectrometry, it was uncovered that 30,67,105,106,110,111

some 3’-UTRs could be translated, and that their peptide products do exist in cells

and

fulfill important functions. For example, the 3’-UTR of the H60 histocompatibility gene is translated and

ACS Paragon Plus Environment

Page 7 of 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

MHC I molecules present this peptide to the immune system15,112. It elicits the activation of cytotoxic T cells and induces self-tolerance, establishing that immune surveillance extends well beyond conventional polypeptides. Recent studies have revealed that a sORF in the 3’-UTR of the MRVI1 gene produces a peptide which co-localizes and interacts with BRCA1, although the role of this interaction is unknown105. Many 3’-UTR sORFs are predicted to exist, and the aforementioned examples clearly indicate that 3’UTR-encoded peptides deserve much more attention than thought before. Long non-coding RNAs Transcripts longer than 200 nucleotides, while lacking ORFs longer than a hundred amino acids, are usually considered to be long non-coding RNAs, and they represent a huge class of RNAs in the cell. Owing to their length, they could hide many bioactive peptide genes not yet annotated, although the 23

number of such peptides constitutes only a small fraction of those predicted . Luckily, this field has not been without revelations. Many lncRNAs were shown to produce peptides17,72,113-115 with diverse functions in a cell. The very first examples were two tiny peptides, only 12 and 24 aa long, encoded by lncRNA ENOD40 in soybeans. They interact with a specific subunit of the sucrose synthase (nodulin 100), and their involvement in the control of sucrose use in nitrogen-fixing nodules was suggested. Another instance is a peptide encoded by the tarsal-less (tal) lncRNA which controls epidermal differentiation in Drosophila17. Notably, this peptide (Pri) triggers amino-terminal truncation of the Shavenbaby (Svb) transcription factor, thereby activating it. Pri peptide controls the recognition of Svb by 70

Ubr3 ubiquitin ligase and activates its processing by the proteasome . Regarding Drosophila, there are three more examples of sORF encoded bioactive peptides which were conserved for more than 550 million years in a range of species from flies to humans, and are related to the vertebrate peptides Sarcolipin and Phospholamban in the sequence and predicted structure72. They are less than thirty amino acids (30 aa) long, and participate in the regulation of calcium transport, consequently affecting regular muscle contraction in the Drosophila heart. Next, the 90 amino acid-long SPAR polypeptide encoded by LINC00961 is also conserved across species114. SPAR appears to control the activity of mTORC--a critical sensor of nutrient availability within the cell that regulates a variety of cellular processes, including translation, metabolism, cell growth, and proliferation. One more example, pgc RNA considered noncoding, actually, is coding a 71 aa polypeptide that interacts with positive transcription elongation factor b 71

(P-TEFb) and prevent its recruitment to transcription sites by their interaction .These examples strongly demonstrate that lncRNAs, previously considered as noncoding, are a well-hidden source of functional peptide genes. Circular RNAs Recently, it was shown that thousands of circular RNAs (circRNAs) are produced in eukaryotic cells

116-119

.

This new class of regulatory RNAs shape gene expression by titrating microRNAs, regulating 120

transcription, and interfering with splicing

. However, their function is not limited to the abovementioned,

since it was discovered that circRNAs are also capable to code for proteins121. It was noted that a set of circRNAs is associated with ribosomes, and is translated. In response to specific signals, circMbl1121 RNA may be translated to produce a peptide which targets synapses, illustrating that communication between neurons might involve so far uncharacterized mechanisms. Moreover, circMbl1 translation is induced by

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 23

starvation and certain pathways involved in aging, hinting to a possible link between circRNA coding potential and aging processes. Pri-micro RNAs Since miRNAs are transcribed as large primary transcripts with subsequent maturation to active miRNA, their precursor has a possibility to code for functional peptides. This idea was successfully confirmed, and studies show that pri-miR171b, pri-miR165a, and pri-TAS transcripts from various plants contain short open reading frames122,123, which somehow participate in the stabilization of pre-miRNA, thus influencing the level of active miRNA. Five ORFs were found in other pri-miRNAs, suggesting that pri-miRNA encoded peptides are common in plants122. However, the authors raise a reasonable question: are miRNA peptides present in humans? Future studies can shed light on their existence in animals. Ribosomal RNAs Several studies uncover an exciting fact: it turns out ribosomal RNAs, particularly mitochondrial, may code for functional peptides. A polypeptide highly conserved across species, labeled as humanin, was discovered to be encoded in mitochondrial 16S rRNA9. Interacting with the Bcl-2–associated X protein (Bax), humanin prevents Bax activation and cell death124,125. Since then it was reported that humanin is involved in a variety of biological processes such as apoptosis, inflammatory response, cell survival, substrate metabolism, and response to oxidative stress, ischemia, and starvation126,127. Later, it was proved that mitochondrial 12S rRNA also has coding potential. It was reported to be 10

translated into the MOTS-c peptide (mitochondrial open reading frame of the 12S rRNA-c) , which regulates insulin sensitivity and energy homeostasis, affecting the folate cycle, and de novo purine biosynthesis. Conclusions Repeatedly, nature shows us that what was impossible yesterday becomes reality now. In this review, we have attempted to reflect the emerging and complex field of short ORFs. Short peptides encoded by sORFs play various eminent roles in cells ranging from metabolism and translation regulation, to aging and cell death. Nowadays, scientists have an arsenal of powerful methods for sORFs investigation. Many bioinformatic tools were developed and subsequently refined in order to meet growing needs for accurate and reliable sORFs prediction. These can be complemented by ribosome profiling and massspectrometry approaches. Further need is envisioned for integration of evolutionary conservation, ribosome profiling and mass spectrometry data in a single computational platform. sORFs are found in different locations relative to protein coding genes or what seemed to be non-coding transcripts, and only now we start to understand the functions of these small peptides and their distribution in different organisms. We are at the beginning of the long path to classification and systematization of all short peptides. Some classes that were previously annotated as noncoding (lncRNAs and rRNA, for example) or parts of other mRNAs (5’- and 3’-UTRs), were shown to encode functionally important peptides. Are there any other classes of non-coding RNA molecules that could potentially code for peptides? What are the functions of such peptides? These questions remain to be answered.

ACS Paragon Plus Environment

Page 9 of 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

FUNDING This work was supported by Russian Science Foundation (grant 14-14-00072), and Moscow University Development Program (grant PNR 5.13). CONFLICT OF INTEREST STATEMENT No conflicts of interest declared.

ACKNOWLEDGMENTS We thank all members of the P.V.S. group for discussions and inspiration and Alex Lebedeff for improvement of the manuscript.

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

References (1) Wadler, C. S.; Vanderpool, C. K., A dual function for a bacterial small RNA: SgrS performs base pairingdependent regulation and encodes a functional polypeptide. Proc Natl Acad Sci U S A 2007, 104, 20454-20459. (2) Casson, S. A.; Chilley, P. M.; Topping, J. F.; Evans, I. M.; Souter, M. A.; Lindsey, K., The POLARIS gene of Arabidopsis encodes a predicted peptide required for correct root growth and leaf vascular patterning. Plant Cell 2002, 14, 1705-1721. (3) Rohrig, H.; Schmidt, J.; Miklashevichs, E.; Schell, J.; John, M., Soybean ENOD40 encodes two peptides that bind to sucrose synthase. Proc Natl Acad Sci U S A 2002, 99, 1915-1920. (4) Dong, X.; Wang, D.; Liu, P.; Li, C.; Zhao, Q.; Zhu, D.; Yu, J., Zm908p11, encoded by a short open reading frame (sORF) gene, functions in pollen tube growth as a profilin ligand in maize. J Exp Bot 2013, 64, 2359-2372. (5) Kastenmayer, J. P.; Ni, L.; Chu, A.; Kitchen, L. E.; Au, W. C.; Yang, H.; Carter, C. D.; Wheeler, D.; Davis, R. W.; Boeke, J. D.; Snyder, M. A.; Basrai, M. A., Functional genomics of genes with small open reading frames (sORFs) in S. cerevisiae. Genome Res 2006, 16, 365-373. (6) Gleason, C. A.; Liu, Q. L.; Williamson, V. M., Silencing a candidate nematode effector gene corresponding to the tomato resistance gene Mi-1 leads to acquisition of virulence. Mol Plant Microbe Interact 2008, 21, 576-585. (7) Kondo, T.; Hashimoto, Y.; Kato, K.; Inagaki, S.; Hayashi, S.; Kageyama, Y., Small peptide regulators of actin-based cell morphogenesis encoded by a polycistronic mRNA. Nat Cell Biol 2007, 9, 660-665. (8) Galindo, M. I.; Pueyo, J. I.; Fouix, S.; Bishop, S. A.; Couso, J. P., Peptides encoded by short ORFs control development and define a new eukaryotic gene family. PLoS Biol 2007, 5, e106. (9) Hashimoto, Y.; Niikura, T.; Tajima, H.; Yasukawa, T.; Sudo, H.; Ito, Y.; Kita, Y.; Kawasumi, M.; Kouyama, K.; Doyu, M.; Sobue, G.; Koide, T.; Tsuji, S.; Lang, J.; Kurokawa, K.; Nishimoto, I., A rescue factor abolishing neuronal cell death by a wide spectrum of familial Alzheimer's disease genes and Abeta. Proc Natl Acad Sci U S A 2001, 98, 6336-6341. (10) Lee, C.; Zeng, J.; Drew, B. G.; Sallam, T.; Martin-Montalvo, A.; Wan, J.; Kim, S. J.; Mehta, H.; Hevener, A. L.; de Cabo, R.; Cohen, P., The mitochondrial-derived peptide MOTS-c promotes metabolic homeostasis and reduces obesity and insulin resistance. Cell Metab 2015, 21, 443-454. (11) Makarewich, C. A.; Olson, E. N., Mining for Micropeptides. Trends Cell Biol 2017, 27, 685-696. (12) Plaza, S.; Menschaert, G.; Payre, F., In Search of Lost Small Peptides. Annu Rev Cell Dev Biol 2017, 33, 391-416. (13) Couso, J. P.; Patraquim, P., Classification and function of small open reading frames. Nat Rev Mol Cell Biol 2017, 18, 575-589. (14) Yosten, G. L.; Liu, J.; Ji, H.; Sandberg, K.; Speth, R.; Samson, W. K., A 5'-upstream short open reading frame encoded peptide regulates angiotensin type 1a receptor production and signalling via the beta-arrestin pathway. J Physiol 2016, 594, 1601-1605. (15) Schwab, S. R.; Li, K. C.; Kang, C.; Shastri, N., Constitutive display of cryptic translation products by MHC class I molecules. Science 2003, 301, 1367-1371. (16) Wang, R. F.; Parkhurst, M. R.; Kawakami, Y.; Robbins, P. F.; Rosenberg, S. A., Utilization of an alternative open reading frame of a normal gene in generating a novel human cancer antigen. J Exp Med 1996, 183, 1131-1140. (17) Kondo, T.; Plaza, S.; Zanet, J.; Benrabah, E.; Valenti, P.; Hashimoto, Y.; Kobayashi, S.; Payre, F.; Kageyama, Y., Small peptides switch the transcriptional activity of Shavenbaby during Drosophila embryogenesis. Science 2010, 329, 336-339. (18) Pauli, A.; Norris, M. L.; Valen, E.; Chew, G. L.; Gagnon, J. A.; Zimmerman, S.; Mitchell, A.; Ma, J.; Dubrulle, J.; Reyon, D.; Tsai, S. Q.; Joung, J. K.; Saghatelian, A.; Schier, A. F., Toddler: an embryonic signal that promotes cell movement via Apelin receptors. Science 2014, 343, 1248636. (19) Basrai, M. A.; Hieter, P.; Boeke, J. D., Small open reading frames: beautiful needles in the haystack. Genome Res 1997, 7, 768-771. (20) Frith, M. C.; Forrest, A. R.; Nourbakhsh, E.; Pang, K. C.; Kai, C.; Kawai, J.; Carninci, P.; Hayashizaki, Y.; Bailey, T. L.; Grimmond, S. M., The abundance of short proteins in the mammalian proteome. PLoS Genet 2006, 2, e52. (21) Tanaka, M.; Sotta, N.; Yamazumi, Y.; Yamashita, Y.; Miwa, K.; Murota, K.; Chiba, Y.; Hirai, M. Y.; Akiyama, T.; Onouchi, H.; Naito, S.; Fujiwara, T., The Minimum Open Reading Frame, AUG-Stop, Induces BoronDependent Ribosome Stalling and mRNA Degradation. Plant Cell 2016, 28, 2830-2849. (22) Ivanov, I. P.; Firth, A. E.; Michel, A. M.; Atkins, J. F.; Baranov, P. V., Identification of evolutionarily conserved non-AUG-initiated N-terminal extensions in human coding sequences. Nucleic Acids Res 2011, 39, 42204234. (23) Slavoff, S. A.; Mitchell, A. J.; Schwaid, A. G.; Cabili, M. N.; Ma, J.; Levin, J. Z.; Karger, A. D.; Budnik, B. A.; Rinn, J. L.; Saghatelian, A., Peptidomic discovery of short open reading frame-encoded peptides in human cells. Nat Chem Biol 2013, 9, 59-64. (24) Andrews, S. J.; Rothnagel, J. A., Emerging evidence for functional peptides encoded by short open reading frames. Nat Rev Genet 2014, 15, 193-204.

ACS Paragon Plus Environment

Page 10 of 23

Page 11 of 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(25) Housman, G.; Ulitsky, I., Methods for distinguishing between protein-coding and long noncoding RNAs and the elusive biological purpose of translation of long noncoding RNAs. Biochim Biophys Acta 2016, 1859, 31-40. (26) Siepel, A.; Bejerano, G.; Pedersen, J. S.; Hinrichs, A. S.; Hou, M.; Rosenbloom, K.; Clawson, H.; Spieth, J.; Hillier, L. W.; Richards, S.; Weinstock, G. M.; Wilson, R. K.; Gibbs, R. A.; Kent, W. J.; Miller, W.; Haussler, D., Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 2005, 15, 1034-50. (27) Yang, Z., Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol Biol Evol 1998, 15, 568-573. (28) Nei, M.; Gojobori, T., Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol 1986, 3, 418-426. (29) Lin, M. F.; Jungreis, I.; Kellis, M., PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 2011, 27, i275-282. (30) Bazzini, A. A.; Johnstone, T. G.; Christiano, R.; Mackowiak, S. D.; Obermayer, B.; Fleming, E. S.; Vejnar, C. E.; Lee, M. T.; Rajewsky, N.; Walther, T. C.; Giraldez, A. J., Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation. EMBO J 2014, 33, 981-993. (31) Couso, J. P., Finding smORFs: getting closer. Genome Biol 2015, 16, 189. (32) Eddy, S. R., A model of the statistical power of comparative genome sequence analysis. PLoS Biol 2005, 3, e10. (33) Gelfand, M. S., Prediction of function in DNA sequence analysis. J Comput Biol 1995, 2, 87-115. (34) Alioto, T.; Guigó, R., State of the art in eukaryotic gene prediction. In Modern Genome Annotation: the BioSapiens Network, Frishman, D.; Valencia, A., Eds. Springer: Vienna, 2008; pp 7–40. (35) Louie, E.; Ott, J.; Majewski, J., Nucleotide frequency variation across human genes. Genome Res 2003, 13, 2594-2601. (36) Koonin, E. V.; Galperin, M. Y., In Sequence - Evolution - Function: Computational Approaches in Comparative Genomics, Boston, 2003. (37) Badger, J. H.; Olsen, G. J., CRITICA: coding region identification tool invoking comparative analysis. Mol Biol Evol 1999, 16, 512-524. (38) Ingolia, N. T.; Ghaemmaghami, S.; Newman, J. R.; Weissman, J. S., Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 2009, 324, 218-223. (39) Chew, G. L.; Pauli, A.; Rinn, J. L.; Regev, A.; Schier, A. F.; Valen, E., Ribosome profiling reveals resemblance between long non-coding RNAs and 5' leaders of coding RNAs. Development 2013, 140, 2828-2834. (40) Smith, J. E.; Alvarez-Dominguez, J. R.; Kline, N.; Huynh, N. J.; Geisler, S.; Hu, W.; Coller, J.; Baker, K. E., Translation of small open reading frames within unannotated RNA transcripts in Saccharomyces cerevisiae. Cell Rep 2014, 7, 1858-1866. (41) Ingolia, N. T.; Lareau, L. F.; Weissman, J. S., Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell 2011, 147, 789-802. (42) Ingolia, N. T.; Brar, G. A.; Stern-Ginossar, N.; Harris, M. S.; Talhouarne, G. J.; Jackson, S. E.; Wills, M. R.; Weissman, J. S., Ribosome profiling reveals pervasive translation outside of annotated protein-coding genes. Cell Rep 2014, 8, 1365-1379. (43) Guttman, M.; Russell, P.; Ingolia, N. T.; Weissman, J. S.; Lander, E. S., Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins. Cell 2013, 154, 240-251. (44) Wilson, B. A.; Masel, J., Putatively noncoding transcripts show extensive association with ribosomes. Genome Biol Evol 2011, 3, 1245-1252. (45) Lee, S.; Liu, B.; Lee, S.; Huang, S. X.; Shen, B.; Qian, S. B., Global mapping of translation initiation sites in mammalian cells at single-nucleotide resolution. Proc Natl Acad Sci U S A 2012, 109, E2424-2432. (46) Gao, X.; Wan, J.; Liu, B.; Ma, M.; Shen, B.; Qian, S. B., Quantitative profiling of initiating ribosomes in vivo. Nat Methods 2015, 12, 147-153. (47) Heiman, M.; Schaefer, A.; Gong, S.; Peterson, J. D.; Day, M.; Ramsey, K. E.; Suarez-Farinas, M.; Schwarz, C.; Stephan, D. A.; Surmeier, D. J.; Greengard, P.; Heintz, N., A translational profiling approach for the molecular characterization of CNS cell types. Cell 2008, 135, 738-748. (48) Sanz, E.; Yang, L.; Su, T.; Morris, D. R.; McKnight, G. S.; Amieux, P. S., Cell-type-specific isolation of ribosome-associated mRNA from complex tissues. Proc Natl Acad Sci U S A 2009, 106, 13939-13944. (49) Aspden, J. L.; Eyre-Walker, Y. C.; Phillips, R. J.; Amin, U.; Mumtaz, M. A.; Brocard, M.; Couso, J. P., Extensive translation of small Open Reading Frames revealed by Poly-Ribo-Seq. Elife 2014, 3, e03528. (50) Archer, S. K.; Shirokikh, N. E.; Beilharz, T. H.; Preiss, T., Dynamics of ribosome scanning and recycling revealed by translation complex profiling. Nature 2016, 535, 570-574. (51) Raj, A.; Wang, S. H.; Shim, H.; Harpak, A.; Li, Y. I.; Engelmann, B.; Stephens, M.; Gilad, Y.; Pritchard, J. K., Thousands of novel translated open reading frames in humans inferred by ribosome footprint profiling. Elife 2016, 5. (52) Ji, Z.; Song, R.; Regev, A.; Struhl, K., Many lncRNAs, 5'UTRs, and pseudogenes are translated and some are likely to express functional proteins. Elife 2015, 4, e08890. (53) Calviello, L.; Mukherjee, N.; Wyler, E.; Zauber, H.; Hirsekorn, A.; Selbach, M.; Landthaler, M.; Obermayer, B.; Ohler, U., Detecting actively translated open reading frames in ribosome profiling data. Nat Methods 2016, 13, 165-170.

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(54) Chun, S. Y.; Rodriguez, C. M.; Todd, P. K.; Mills, R. E., SPECtre: a spectral coherence--based classifier of actively translated transcripts from ribosome profiling sequence data. BMC Bioinformatics 2016, 17, 482. (55) Dunn, J. G.; Weissman, J. S., Plastid: nucleotide-resolution analysis of next-generation sequencing and genomics data. BMC Genomics 2016, 17, 958. (56) Michel, A. M.; Choudhury, K. R.; Firth, A. E.; Ingolia, N. T.; Atkins, J. F.; Baranov, P. V., Observation of dually decoded regions of the human genome using ribosome profiling data. Genome Res 2012, 22, 2219-2229. (57) Olexiouk, V.; Crappe, J.; Verbruggen, S.; Verhegen, K.; Martens, L.; Menschaert, G., sORFs.org: a repository of small ORFs identified by ribosome profiling. Nucleic Acids Res 2016, 44, D324-329. (58) Michel, A. M.; Ahern, A. M.; Donohue, C. A.; Baranov, P. V., GWIPS-viz as a tool for exploring ribosome profiling evidence supporting the synthesis of alternative proteoforms. Proteomics 2015, 15, 2410-2416. (59) Michel, A. M.; Mullan, J. P.; Velayudhan, V.; O'Connor, P. B.; Donohue, C. A.; Baranov, P. V., RiboGalaxy: A browser based platform for the alignment, analysis and visualization of ribosome profiling data. RNA Biol 2016, 13, 316-319. (60) O'Connor, P. B.; Andreev, D. E.; Baranov, P. V., Comparative survey of the relative impact of mRNA features on local ribosome profiling read density. Nat Commun 2016, 7, 12915. (61) Bantscheff, M.; Lemeer, S.; Savitski, M. M.; Kuster, B., Quantitative mass spectrometry in proteomics: critical review update from 2007 to the present. Anal Bioanal Chem 2012, 404, 939-965. (62) Nesvizhskii, A. I., Proteogenomics: concepts, applications and computational strategies. Nat Methods 2014, 11, 1114-1125. (63) Fermin, D.; Allen, B. B.; Blackwell, T. W.; Menon, R.; Adamski, M.; Xu, Y.; Ulintz, P.; Omenn, G. S.; States, D. J., Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics. Genome Biol 2006, 7, R35. (64) Khatun, J.; Yu, Y.; Wrobel, J. A.; Risk, B. A.; Gunawardena, H. P.; Secrest, A.; Spitzer, W. J.; Xie, L.; Wang, L.; Chen, X.; Giddings, M. C., Whole human genome proteogenomic mapping for ENCODE cell line data: identifying protein-coding regions. BMC Genomics 2013, 14, 141. (65) Edwards, N. J., Novel peptide identification from tandem mass spectra using ESTs and sequence database compression. Mol Syst Biol 2007, 3, 102. (66) Wang, X.; Slebos, R. J.; Wang, D.; Halvey, P. J.; Tabb, D. L.; Liebler, D. C.; Zhang, B., Protein identification using customized protein sequence databases derived from RNA-Seq data. J Proteome Res 2012, 11, 1009-1017. (67) Ma, J.; Ward, C. C.; Jungreis, I.; Slavoff, S. A.; Schwaid, A. G.; Neveu, J.; Budnik, B. A.; Kellis, M.; Saghatelian, A., Discovery of human sORF-encoded polypeptides (SEPs) in cell lines and tissue. J Proteome Res 2014, 13, 1757-1765. (68) Crappe, J.; Ndah, E.; Koch, A.; Steyaert, S.; Gawron, D.; De Keulenaer, S.; De Meester, E.; De Meyer, T.; Van Criekinge, W.; Van Damme, P.; Menschaert, G., PROTEOFORMER: deep proteome coverage through ribosome profiling and MS integration. Nucleic Acids Res 2015, 43, e29. (69) Slavoff, S. A.; Heo, J.; Budnik, B. A.; Hanakahi, L. A.; Saghatelian, A., A human short open reading frame (sORF)-encoded polypeptide that stimulates DNA end joining. J Biol Chem 2014, 289, 10950-10957. (70) Zanet, J.; Benrabah, E.; Li, T.; Pelissier-Monier, A.; Chanut-Delalande, H.; Ronsin, B.; Bellen, H. J.; Payre, F.; Plaza, S., Pri sORF peptides induce selective proteasome-mediated protein processing. Science 2015, 349, 13561358. (71) Hanyu-Nakamura, K.; Sonobe-Nojima, H.; Tanigawa, A.; Lasko, P.; Nakamura, A., Drosophila Pgc protein inhibits P-TEFb recruitment to chromatin in primordial germ cells. Nature 2008, 451, 730-733. (72) Magny, E. G.; Pueyo, J. I.; Pearl, F. M.; Cespedes, M. A.; Niven, J. E.; Bishop, S. A.; Couso, J. P., Conserved regulation of cardiac calcium uptake by peptides encoded in small open reading frames. Science 2013, 341, 1116-1120. (73) Pueyo, J. I.; Magny, E. G.; Sampson, C. J.; Amin, U.; Evans, I. R.; Bishop, S. A.; Couso, J. P., Hemotin, a Regulator of Phagocytosis Encoded by a Small ORF and Conserved across Metazoans. PLoS Biol 2016, 14, e1002395. (74) Ivanov, I. P.; Loughran, G.; Atkins, J. F., uORFs with unusual translational start codons autoregulate expression of eukaryotic ornithine decarboxylase homologs. Proc Natl Acad Sci U S A 2008, 105, 10079-10084. (75) Wiese, A.; Elzinga, N.; Wobbes, B.; Smeekens, S., A conserved upstream open reading frame mediates sucrose-induced repression of translation. Plant Cell 2004, 16, 1717-1729. (76) Hanfrey, C.; Elliott, K. A.; Franceschetti, M.; Mayer, M. J.; Illingworth, C.; Michael, A. J., A dual upstream open reading frame-based autoregulatory circuit controlling polyamine-responsive translation. J Biol Chem 2005, 280, 39229-39237. (77) Young, S. K.; Wek, R. C., Upstream Open Reading Frames Differentially Regulate Gene-specific Translation in the Integrated Stress Response. J Biol Chem 2016, 291, 16927-16935. (78) Neafsey, D. E.; Galagan, J. E., Dual modes of natural selection on upstream open reading frames. Mol Biol Evol 2007, 24, 1744-1751. (79) Crowe, M. L.; Wang, X. Q.; Rothnagel, J. A., Evidence for conservation and selection of upstream open reading frames suggests probable encoding of bioactive peptides. BMC Genomics 2006, 7, 16.

ACS Paragon Plus Environment

Page 12 of 23

Page 13 of 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(80) Calvo, S. E.; Pagliarini, D. J.; Mootha, V. K., Upstream open reading frames cause widespread reduction of protein expression and are polymorphic among humans. Proc Natl Acad Sci U S A 2009, 106, 7507-7512. (81) Barbosa, C.; Peixeiro, I.; Romao, L., Gene expression regulation by upstream open reading frames and human disease. PLoS Genet 2013, 9, e1003529. (82) Iacono, M.; Mignone, F.; Pesole, G., uAUG and uORFs in human and rodent 5'untranslated mRNAs. Gene 2005, 349, 97-105. (83) Rogers, G. W., Jr.; Edelman, G. M.; Mauro, V. P., Differential utilization of upstream AUGs in the betasecretase mRNA suggests that a shunting mechanism regulates translation. Proc Natl Acad Sci U S A 2004, 101, 2794-2799. (84) Morris, D. R.; Geballe, A. P., Upstream open reading frames as regulators of mRNA translation. Mol Cell Biol 2000, 20, 8635-8642. (85) Mendell, J. T.; Sharifi, N. A.; Meyers, J. L.; Martinez-Murillo, F.; Dietz, H. C., Nonsense surveillance regulates expression of diverse classes of mammalian transcripts and mutes genomic noise. Nat Genet 2004, 36, 1073-1078. (86) Yepiskoposyan, H.; Aeschimann, F.; Nilsson, D.; Okoniewski, M.; Muhlemann, O., Autoregulation of the nonsense-mediated mRNA decay pathway in human cells. RNA 2011, 17, 2108-2118. (87) Ruiz-Echevarria, M. J.; Peltz, S. W., The RNA binding protein Pub1 modulates the stability of transcripts containing upstream open reading frames. Cell 2000, 101, 741-751. (88) Spriggs, K. A.; Bushell, M.; Willis, A. E., Translational regulation of gene expression during conditions of cell stress. Mol Cell 2010, 40, 228-237. (89) Andreev, D. E.; O'Connor, P. B.; Fahey, C.; Kenny, E. M.; Terenin, I. M.; Dmitriev, S. E.; Cormican, P.; Morris, D. W.; Shatsky, I. N.; Baranov, P. V., Translation of 5' leaders is pervasive in genes resistant to eIF2 repression. Elife 2015, 4, e03971. (90) Fang, P.; Wang, Z.; Sachs, M. S., Evolutionarily conserved features of the arginine attenuator peptide provide the necessary requirements for its function in translational regulation. J Biol Chem 2000, 275, 26710-26719. (91) Law, G. L.; Raney, A.; Heusner, C.; Morris, D. R., Polyamine regulation of ribosome pausing at the upstream open reading frame of S-adenosylmethionine decarboxylase. J Biol Chem 2001, 276, 38036-38043. (92) Raney, A.; Law, G. L.; Mize, G. J.; Morris, D. R., Regulated translation termination at the upstream open reading frame in s-adenosylmethionine decarboxylase mRNA. J Biol Chem 2002, 277, 5988-5994. (93) Akimoto, C.; Sakashita, E.; Kasashima, K.; Kuroiwa, K.; Tominaga, K.; Hamamoto, T.; Endo, H., Translational repression of the McKusick-Kaufman syndrome transcript by unique upstream open reading frames encoding mitochondrial proteins with alternative polyadenylation sites. Biochim Biophys Acta 2013, 1830, 27282738. (94) Kochetov, A. V., Alternative translation start sites and hidden coding potential of eukaryotic mRNAs. Bioessays 2008, 30, 683-691. (95) Mouilleron, H.; Delcourt, V.; Roucou, X., Death of a dogma: eukaryotic mRNAs can code for more than one protein. Nucleic Acids Res 2016, 44, 14-23. (96) Rosenberg, S. A.; Tong-On, P.; Li, Y.; Riley, J. P.; El-Gamil, M.; Parkhurst, M. R.; Robbins, P. F., Identification of BING-4 cancer antigen translated from an alternative open reading frame of a gene in the extended MHC class II region using lymphocytes from a patient with a durable complete regression following immunotherapy. J Immunol 2002, 168, 2402-2407. (97) Ronsin, C.; Chung-Scott, V.; Poullion, I.; Aknouche, N.; Gaudin, C.; Triebel, F., A non-AUG-defined alternative open reading frame of the intestinal carboxyl esterase mRNA generates an epitope recognized by renal cell carcinoma-reactive tumor-infiltrating lymphocytes in situ. J Immunol 1999, 163, 483-490. (98) Huang, J.; El-Gamil, M.; Dudley, M. E.; Li, Y. F.; Rosenberg, S. A.; Robbins, P. F., T cells associated with tumor regression recognize frameshifted products of the CDKN2A tumor suppressor gene locus and a mutated HLA class I gene product. J Immunol 2004, 172, 6057-6064. (99) Quelle, D. E.; Zindy, F.; Ashmun, R. A.; Sherr, C. J., Alternative reading frames of the INK4a tumor suppressor gene encode two unrelated proteins capable of inducing cell cycle arrest. Cell 1995, 83, 993-1000. (100) Bab, I.; Smith, E.; Gavish, H.; Attar-Namdar, M.; Chorev, M.; Chen, Y. C.; Muhlrad, A.; Birnbaum, M. J.; Stein, G.; Frenkel, B., Biosynthesis of osteogenic growth peptide via alternative translational initiation at AUG85 of histone H4 mRNA. J Biol Chem 1999, 274, 14474-14481. (101) Klemke, M.; Kehlenbach, R. H.; Huttner, W. B., Two overlapping reading frames in a single exon encode interacting proteins--a novel way of gene usage. EMBO J 2001, 20, 3849-3860. (102) Abramowitz, J.; Grenet, D.; Birnbaumer, M.; Torres, H. N.; Birnbaumer, L., XLalphas, the extra-long form of the alpha-subunit of the Gs G protein, is significantly longer than suspected, and so is its companion Alex. Proc Natl Acad Sci U S A 2004, 101, 8366-8371. (103) Vanderperre, B.; Staskevicius, A. B.; Tremblay, G.; McCoy, M.; O'Neill, M. A.; Cashman, N. R.; Roucou, X., An overlapping reading frame in the PRNP gene encodes a novel polypeptide distinct from the prion protein. FASEB J 2011, 25, 2373-2386. (104) Bergeron, D.; Lapointe, C.; Bissonnette, C.; Tremblay, G.; Motard, J.; Roucou, X., An out-of-frame overlapping reading frame in the ataxin-1 coding sequence encodes a novel ataxin-1 interacting protein. J Biol Chem 2013, 288, 21824-21835.

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(105) Vanderperre, B.; Lucier, J. F.; Bissonnette, C.; Motard, J.; Tremblay, G.; Vanderperre, S.; Wisztorski, M.; Salzet, M.; Boisvert, F. M.; Roucou, X., Direct detection of alternative open reading frames translation products in human significantly expands the proteome. PLoS One 2013, 8, e70698. (106) Mackowiak, S. D.; Zauber, H.; Bielow, C.; Thiel, D.; Kutz, K.; Calviello, L.; Mastrobuoni, G.; Rajewsky, N.; Kempa, S.; Selbach, M.; Obermayer, B., Extensive identification and analysis of conserved small ORFs in animals. Genome Biol 2015, 16, 179. (107) Dunn, J. G.; Foo, C. K.; Belletier, N. G.; Gavis, E. R.; Weissman, J. S., Ribosome profiling reveals pervasive and regulated stop codon readthrough in Drosophila melanogaster. Elife 2013, 2, e01179. (108) Arribere, J. A.; Cenik, E. S.; Jain, N.; Hess, G. T.; Lee, C. H.; Bassik, M. C.; Fire, A. Z., Translation readthrough mitigation. Nature 2016, 534, 719-723. (109) Miettinen, T. P.; Bjorklund, M., Modified ribosome profiling reveals high abundance of ribosome protected mRNA fragments derived from 3' untranslated regions. Nucleic Acids Res 2015, 43, 1019-1034. (110) Gascoigne, D. K.; Cheetham, S. W.; Cattenoz, P. B.; Clark, M. B.; Amaral, P. P.; Taft, R. J.; Wilhelm, D.; Dinger, M. E.; Mattick, J. S., Pinstripe: a suite of programs for integrating transcriptomic and proteomic datasets identifies novel proteins and improves differentiation of protein-coding and non-coding genes. Bioinformatics 2012, 28, 3042-3050. (111) Prabakaran, S.; Hemberg, M.; Chauhan, R.; Winter, D.; Tweedie-Cullen, R. Y.; Dittrich, C.; Hong, E.; Gunawardena, J.; Steen, H.; Kreiman, G.; Steen, J. A., Quantitative profiling of peptides from RNAs classified as noncoding. Nat Commun 2014, 5, 5429. (112) Malarkannan, S.; Shih, P. P.; Eden, P. A.; Horng, T.; Zuberi, A. R.; Christianson, G.; Roopenian, D.; Shastri, N., The molecular and functional characterization of a dominant minor H antigen, H60. J Immunol 1998, 161, 3501-3509. (113) Anderson, D. M.; Anderson, K. M.; Chang, C. L.; Makarewich, C. A.; Nelson, B. R.; McAnally, J. R.; Kasaragod, P.; Shelton, J. M.; Liou, J.; Bassel-Duby, R.; Olson, E. N., A micropeptide encoded by a putative long noncoding RNA regulates muscle performance. Cell 2015, 160, 595-606. (114) Matsumoto, A.; Pasut, A.; Matsumoto, M.; Yamashita, R.; Fung, J.; Monteleone, E.; Saghatelian, A.; Nakayama, K. I.; Clohessy, J. G.; Pandolfi, P. P., mTORC1 and muscle regeneration are regulated by the LINC00961-encoded SPAR polypeptide. Nature 2017, 541, 228-232. (115) Nelson, B. R.; Makarewich, C. A.; Anderson, D. M.; Winders, B. R.; Troupes, C. D.; Wu, F.; Reese, A. L.; McAnally, J. R.; Chen, X.; Kavalali, E. T.; Cannon, S. C.; Houser, S. R.; Bassel-Duby, R.; Olson, E. N., A peptide encoded by a transcript annotated as long noncoding RNA enhances SERCA activity in muscle. Science 2016, 351, 271-275. (116) Salzman, J.; Gawad, C.; Wang, P. L.; Lacayo, N.; Brown, P. O., Circular RNAs are the predominant transcript isoform from hundreds of human genes in diverse cell types. PLoS One 2012, 7, e30733. (117) Jeck, W. R.; Sorrentino, J. A.; Wang, K.; Slevin, M. K.; Burd, C. E.; Liu, J.; Marzluff, W. F.; Sharpless, N. E., Circular RNAs are abundant, conserved, and associated with ALU repeats. RNA 2013, 19, 141-157. (118) Memczak, S.; Jens, M.; Elefsinioti, A.; Torti, F.; Krueger, J.; Rybak, A.; Maier, L.; Mackowiak, S. D.; Gregersen, L. H.; Munschauer, M.; Loewer, A.; Ziebold, U.; Landthaler, M.; Kocks, C.; le Noble, F.; Rajewsky, N., Circular RNAs are a large class of animal RNAs with regulatory potency. Nature 2013, 495, 333-8. (119) Salzman, J.; Chen, R. E.; Olsen, M. N.; Wang, P. L.; Brown, P. O., Cell-type specific features of circular RNA expression. PLoS Genet 2013, 9, e1003777. (120) Chen, L. L., The biogenesis and emerging roles of circular RNAs. Nat Rev Mol Cell Biol 2016, 17, 205211. (121) Pamudurti, N. R.; Bartok, O.; Jens, M.; Ashwal-Fluss, R.; Stottmeister, C.; Ruhe, L.; Hanan, M.; Wyler, E.; Perez-Hernandez, D.; Ramberger, E.; Shenzis, S.; Samson, M.; Dittmar, G.; Landthaler, M.; Chekulaeva, M.; Rajewsky, N.; Kadener, S., Translation of CircRNAs. Mol Cell 2017, 66, 9-21 e7. (122) Lauressergues, D.; Couzigou, J. M.; Clemente, H. S.; Martinez, Y.; Dunand, C.; Becard, G.; Combier, J. P., Primary transcripts of microRNAs encode regulatory peptides. Nature 2015, 520, 90-93. (123) Yoshikawa, M.; Iki, T.; Numa, H.; Miyashita, K.; Meshi, T.; Ishikawa, M., A Short Open Reading Frame Encompassing the MicroRNA173 Target Site Plays a Role in trans-Acting Small Interfering RNA Biogenesis. Plant Physiol 2016, 171, 359-368. (124) Guo, B.; Zhai, D.; Cabezas, E.; Welsh, K.; Nouraini, S.; Satterthwait, A. C.; Reed, J. C., Humanin peptide suppresses apoptosis by interfering with Bax activation. Nature 2003, 423, 456-461. (125) Zhai, D.; Luciano, F.; Zhu, X.; Guo, B.; Satterthwait, A. C.; Reed, J. C., Humanin binds and nullifies Bid activity by blocking its activation of Bax and Bak. J Biol Chem 2005, 280, 15815-15824. (126) Lee, C.; Yen, K.; Cohen, P., Humanin: a harbinger of mitochondrial-derived peptides? Trends Endocrinol Metab 2013, 24, 222-228. (127) Gong, Z.; Tas, E.; Muzumdar, R., Humanin and age-related diseases: a new link? Front Endocrinol (Lausanne) 2014, 5, 210. (128) Kong, L.; Zhang, Y.; Ye, Z. Q.; Liu, X. Q.; Zhao, S. Q.; Wei, L.; Gao, G., CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res 2007, 35, W345-9. (129) Hanada, K.; Akiyama, K.; Sakurai, T.; Toyoda, T.; Shinozaki, K.; Shiu, S. H., sORF finder: a program package to identify small open reading frames with high coding potential. Bioinformatics 2010, 26, 399-400.

ACS Paragon Plus Environment

Page 14 of 23

Page 15 of 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(130) Skarshewski, A.; Stanton-Cook, M.; Huber, T.; Al Mansoori, S.; Smith, R.; Beatson, S. A.; Rothnagel, J. A., uPEPperoni: an online tool for upstream open reading frame location and analysis of transcript conservation. BMC Bioinformatics 2014, 15, 36. (131) Fields, A. P.; Rodriguez, E. H.; Jovanovic, M.; Stern-Ginossar, N.; Haas, B. J.; Mertins, P.; Raychowdhury, R.; Hacohen, N.; Carr, S. A.; Ingolia, N. T.; Regev, A.; Weissman, J. S., A Regression-Based Analysis of Ribosome-Profiling Data Reveals a Conserved Complexity to Mammalian Translation. Mol Cell 2015, 60, 816-827.

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 23

Figure legends Figure 1. Criteria and methods used for detection of putative short ORFs coding potential by computational sequence analysis. (A) An example of a conservation metric distribution by genomic coordinate in 3 coding frames. Conservation metrics might include simple prevalence of synonymous over non synonymous codon substitutions (dS/dN) or a score that takes into account likelihood to find the particular substitutions in coding regions at particular cross-species evolutionary distances related to that for non-coding regions. A set of equivalent aligned genome regions for several species is needed. (B) An example of a likelihood distribution for a sliding sequence window in three frames to code for a protein. Here, the metric might be either a codon and dicodon (hexamer) usage preference or 3 nucleotide periodicity in nucleotide composition. (C) An example of protein sequence alignment of a particular putative small protein with similar sequence patches found in other proteins, not necessarily located in the similar genomic contexts. Colored circles illustrate amino acid identities, lines deletions. (D) Identification of a region within a potential functional sORF that match to any conserved protein domain, schematically depicted on a panel as red box indicating an arbitrary domain of unknown function. Figure 2. (A) Ribosome profiling workflow. mRNAs in complex with ribosomes are extracted from the cell. Upon nuclease treatment only ribosome-bound regions of RNA remain. Protein removal with subsequent deep-sequencing reveals ribosome positions on a particular RNA. (B-E) Different principles for bioinformatical analysis of RIBO-seq data: (B) Ribosome release score. The blue line indicates mRNA and the red cylinder - ORF. Bars illustrate density of ribosomes at each position on the transcript. In case of protein coding mRNA, distribution of ribosomes after stop codon shows dramatic reduction comparing to non-coding RNA. (C) RiboTaper. Three possible ORFs are shown as +1 (red), +2 (light blue) and +3 (mint green). In the examples, only +1 frame codes for a functional peptide. Instead of AUG and UGA codons could be any start or stop codons correspondingly. P-sites of every ribosome footprints are mapped to the annotated ORFs. If ORF is translated, it would be enriched by number of P-sites. (D) ORF score. Coding ORF contains most of the mapped footprints. (E) FLOSS. Plot schematically indicates footprint’s length distribution for coding and non-coding RNAs. See

42

for experimental data. Ribosome

footprints are around 30-32 nt long. See Table 2 for a brief description of program packages implementing each of the computational tools described in the Figure 2. The red circle indicates the 5’cap of RNA; RPF, ribosome protected fragments; ORF, open reading frame. Figure 3. Peptide identification in mass-spectrometry shotgun approach. (A) Protein extract is digested with trypsin or other proteases into peptides, which are separated by liquid chromatography with subsequent tandem mass-spectrometry analysis. (B) Protein sequences in the database are theoretically cleaved into fragments based on the recognition preferences of the protease used in (a). Then, theoretical MS/MS spectrum is generated. (C) Comparison of theoretical and experimental spectra allows peptide identification. LCMS, liquid chromatography-mass spectrometry; MS/MS, tandem mass spectrum; R.I., relative intensity. Figure 4. Overview of RNA molecules encoding short ORFs (sORFs). Red circle indicates a 5’-cap of RNA ORF, open reading frame; UTR, untranslated region.

Table 1. A list of available sORF prediction software

ACS Paragon Plus Environment

Page 17 of 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Program

Reference

Working principle

37

Analysis of nucleotide sequence composition

CRITICA (coding region identification tool invoking

and conservation at the amino acid level

comparative analysis) CPC (Coding Potential

Analysis of ORF qualities (ORF size,

128

Calculator)

coverage, integrity) and conservation Analysis of nucleotide sequence composition

sORFinder

129

PhastCons

26

Conservation

PhyloCSF

29

Conservation

micPDP

30

uPEPperoni

130

and conservation at the amino acid level

Quality of the ORF (ORF size, coverage, integrity) and conservation Conservation (only for 5’UTR sORFs)

Table 2. A list of available tools for RIBO-seq data analysis Program

Reference

Working principle Quantify the ratio between the total number of reads inside the

Ribosome Release

43

Score (RRS)

coding region to the total number of 3’UTR reads, thus measuring ribosome disassociation at the stop codon of the putative coding region Use four features to assess coding potential: 1. Translation efficiency (ratio of ribosome footprint density over the ORF to its expression derived from RNA seq). 2. Ratio of the coverage inside ORF to that outside ORF.

Translated ORF

39

Classifier (TOC)

Coverage is a number of nucleotides covered by footprints divided by total number of nucleotides. 3. Fraction length (length of ORF divided by total length of the transcript). 4. Disengagement score (number of footprints within ORF divided by that for the area downstream of the stop codon)

Fragment Length Organization Similarity

Measure of coding potential based on the similarity between 42

Score (FLOSS) ORFscore

and known protein coding genes 30

Assess coding potential quantifying RPF distribution in each frame, therefore determining frame in which RPFs are uniformly present

ORF Regression Algorithm for

ribosome protected fragments (RPF) length distribution for ORF

Use linear regression for classification of putative ORFs by the 131

distribution of footprints along a coding region as compared with

Translational

annotated genes in the same dataset. Utilizes data on harringtonin

Evaluation of RPFs

caused stalling at the start codon.

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 23

(ORF-RATER) Discriminate between coding and non-coding based on: RibORF Classifier

52

1. Ribosome footprinting 3-nt periodicity. 2. Uniformity of footpints distribution across a putative ORF

RiboTaper

53

SPECtre

54

Identify putative coding ORFs using 3-nt periodicity of footprints determined by Fourier transformation Identify putative coding ORFs using 3-nt periodicity of footprints determined by the average coherence to ideal 3-nt periodicity in a set of sliding windows

RiboGalaxy

55

An integrated platform for online ribosome footprints data analysis and visualization

Proteoformer

68

A platform for integration of ribosome profiling and massspectrometry data Use hidden Markov models for analysis of ribosome footprints. Takes into consideration expected relative footprint densities in

RiboHMM

51

CDS, 5’- and 3’- UTRs and in the areas adjacent to the CDS to UTR borders, as well as distribution of footprints by position within triplets. Uses normalization by RNAseq.

ACS Paragon Plus Environment

Page 19 of 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 1. Criteria and methods used for detection of putative short ORFs coding potential by computational sequence analysis. (A) An example of a conservation metric distribution by genomic coordinate in 3 coding frames. Conservation metrics might include simple prevalence of synonymous over non synonymous codon substitutions (dS/dN) or a score that takes into account likelihood to find the particular substitutions in coding regions at particular cross-species evolutionary distances related to that for non-coding regions. A set of equivalent aligned genome regions for several species is needed. (B) An example of a likelihood distribution for a sliding sequence window in three frames to code for a protein. Here, the metric might be either a codon and dicodon (hexamer) usage preference or 3 nucleotide periodicity in nucleotide composition. (C) An example of protein sequence alignment of a particular putative small protein with similar sequence patches found in other proteins, not necessarily located in the similar genomic contexts. Colored circles illustrate amino acid identities, lines deletions. (D) Identification of a region within a potential functional sORF that match to any conserved protein domain, schematically depicted on a panel as red box indicating an arbitrary domain of unknown function. 247x81mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2. (A) Ribosome profiling workflow. mRNAs in complex with ribosomes are extracted from the cell. Upon nuclease treatment only ribosome-bound regions of RNA remain. Protein removal with subsequent deep-sequencing reveals ribosome positions on a particular RNA. (B-E) Different principles for bioinformatical analysis of RIBO-seq data: (B) Ribosome release score. The blue line indicates mRNA and the red cylinder - ORF. Bars illustrate density of ribosomes at each position on the transcript. In case of protein coding mRNA, distribution of ribosomes after stop codon shows dramatic reduction comparing to non-coding RNA. (C) RiboTaper. Three possible ORFs are shown as +1 (red), +2 (light blue) and +3 (mint green). In the examples, only +1 frame codes for a functional peptide. Instead of AUG and UGA codons could be any start or stop codons correspondingly. P-sites of every ribosome footprints are mapped to the annotated ORFs. If ORF is translated, it would be enriched by number of P-sites. (D) ORF score. Coding ORF contains most of the mapped footprints. (E) FLOSS. Plot schematically indicates footprint’s length distribution for coding and non-coding RNAs. See (42) for experimental data. Ribosome footprints are around 30-32 nt long. See Table 2 for a brief description of program packages implementing each of the computational tools described in the Figure 2. The red circle indicates the 5’-cap of RNA; RPF, ribosome protected fragments; ORF, open reading frame. 199x170mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 20 of 23

Page 21 of 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 3. Peptide identification in mass-spectrometry shotgun approach. (A) Protein extract is digested with trypsin or other proteases into peptides, which are separated by liquid chromatography with subsequent tandem mass-spectrometry analysis. (B) Protein sequences in the database are theoretically cleaved into fragments based on the recognition preferences of the protease used in (A). Then, theoretical MS/MS spectrum is generated. (C) Comparison of theoretical and experimental spectra allows peptide identification. LCMS, liquid chromatography-mass spectrometry; MS/MS, tandem mass spectrum; R.I., relative intensity. 159x75mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 4. Overview of RNA molecules encoding short ORFs (sORFs). Red circle indicates a 5’-cap of RNA ORF, open reading frame; UTR, untranslated region. 138x84mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 22 of 23

Page 23 of 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

for TOC only 124x68mm (300 x 300 DPI)

ACS Paragon Plus Environment