Skip to main content
BMC is moving to Springer Nature Link. Visit this journal in its new home.

Comprehensive profiling of ribo-seq detected small sequences in yeast reveals robust conservation patterns and their potential mechanisms of origin

Abstract

Background

In the budding yeast Saccharomyces cerevisiae, the widespread adoption of ribosome profiling technology has allowed the discovery of evidence of transcription and translation for thousands of small proteins or microproteins whose importance was once disregarded. Both conserved and evolutionarily short-lived microproteins have demonstrated relevant involvement in biological functions. However, sequences exist in a broad spectrum of conservation. Here, we tested whether these small proteins in yeast detected by ribosome profiling technology have different properties across their levels of conservation, and how do these properties compare with the canonical small protein-coding sequences.

Results

Here, we applied a phylostratigraphic approach to peptides encoded by small open reading frames. We compared 20,023 ribo-seq-detected small peptides against annotated small proteins belonging to reference annotations on the basis of their respective conservation patterns. We identified 1134 unannotated microproteins that, despite their difficulty in being detected by methods other than ribosome profiling, display hallmarks of functionality such as conservation across many taxonomical levels and signals of purifying selection not dissimilar to those of canonical proteins of comparable length. Sequences that initially did not show evidence of belonging to any gene family were found to possess signals of homology traceable mostly at genus level when compared against noncoding regions and using TBLASTN, but also, to a lesser extent, to species belonging to the phyla Basidiomycota and Microsporidia. In addition, we show an analysis of the mutations behind the origin of small open reading frames exclusive to S. cerevisiae and identified changes in the initiation codon as the most common group of mutations when compared to Saccharomyces paradoxus, the closest species to S. cerevisiae.

Conclusions

Our work, by presenting robust analysis of the extended landscape of small proteins in yeast, suggests that small conserved sequences, either canonical or not, possess a shared evolutionary trajectory, as demonstrated by their properties. These results shed some light into the evolutionary processes behind the extended landscape of small proteins in yeast.

Peer Review reports

Introduction

Proteins encoded by small open reading frames (sORFs) (i.e. with fewer than 100 continuous codons) are known both as small proteins or as microproteins. Small proteins were initially dismissed for having a large probability of being artifactual and originated just by chance [1,2,3,4]. Nowadays, however, they are recognized as functional elements in several groups, including yeasts [5], mammals [6], fruit flies [7], and plants [8].

In yeast, based on the two broad categories observed by some authors, small proteins belong either to the group of sequences considered canonical or to a second set of lesser studied sequences that are part of a so called transient translatome [9]. On the first group, small proteins possess high levels of translation and their protein products, for the most part, have recognized functions [10]. These canonical small proteins exist in a continuum of conservation. While some can be conserved across large evolutionary distances [11], others are of more recent origin [12]. The study of the evolutionary trajectories and conservation patterns across this continuum have become especially relevant as it overlaps with the study of protein evolution from non-coding sequences (de novo) [13, 14]. It has been found that better conserved sequences tend to be longer, while recently originated ones tend to be shorter [15]. This suggests that proteins tend to increase in length under the constraints imposed by evolution [12, 16]. Similarly, their evolutionary age has been linked to other properties related to sequence composition: Guanine-Cytosine percentage (GC%) has been found to be higher in novel sequences both in primates [17] and yeasts [18]; the isoelectric points of conserved sequences tend to be lower than isoelectric points of more recent ones [19]; and purifying selection acts stronger in more conserved sequences [16, 19].

On the other hand, belonging to the transient translatome, we have small peptides that were detected using ribo-seq profiling technology. On recent years, the number of these sORFs with evidence of translation has increased by hundreds or even thousands across many species [20,21,22,23,24,25,26,27]. Despite evidence of translation, proving the functionality of these small encoded peptides (SEPs) has remained a challenging issue [14]. Standard experimental approaches and current proteomics technologies struggle to detect peptides with low translation levels [24, 28]. This, unfortunately, is the case for many of those SEPs. For this reason, their biological significance has been a topic strongly contested in the recent years [29]. Some authors have gone as far as to consider them dubious products of “translational noise” [30,31,32].

In 2023, a study in yeast, experimentally identified that many of these small sequences uniquely found by ribo-seq, despite having very poor conservation, are involved in functions related to DNA repair and stress response [9]. On the next year, the same team, by applying co-expression network techniques, found evidence suggesting cellular roles for thousands of these SEPs [33]. These two works join previous efforts done in flies [34] and humans [35] that have compiled evidence supporting that SEPs detected by ribo-seq can be functional.

Although evidence of biological relevance for these less explored small sequences is increasing, it remains to be seen how their evolutionary trajectories and conservation patterns compare to those presented by the better-studied canonical proteins. Similarities on conservation patterns with annotated proteins could suggest further biological importance, while differences, on the other hand, would support the notion that these pervasively translated sequences do not play biological roles. To explore this issue, we wanted to compare the evolutionary trajectories and conservation patterns of well annotated small sequences against ribo-seq detected sequences. To represent well annotated proteins in yeast we collected all the proteins of 100 amino acids or less from the NCBI reference genome R64 of the strain S288C uploaded by the Saccharomyces Genome database [36], which presents the most current sequenced version [37] and from Scannell et al., 2011, which offers an alternative high-quality reference catalog of proteins for yeast and yeast species comparison [38]. To represent the transient translatome we utilized SEPs with evidence of translation produced by ribo-seq technology listed on the SmProt database [23]. We applied a phylostratigraphic approach similar to those described in previous phylogenetic studies [39, 40] aligning S. cerevisiae protein sequences to the proteomes of a diverse selection of 17 fungal species both closely and distantly related to S. cerevisiae to get an estimated evolutionary age for each of the small proteins. We found that a small portion (less than 10%) of the ribo-seq detected proteins could be assigned to a phylostratigraphic age. During this step, sequences were assigned to conservation levels according to the most distant organism with whom it shared orthology. For example, if a sequence had orthology with Candida glabrata and with Saccharomyces mikatae, the sequence would be assigned to the level 6 (Fig. 1). Several patterns across conservation level of these sequences, such as purifying selection, amino acid usage, and length, were comparable to those observed in sequences annotated for reference use. Subsequently, functional annotation revealed that these genes are potentially involved in a diverse array of functions, whereas less conserved sequences are enriched with disordered regions. Next, we analyzed sequences that were not found to belong to a gene family and were deemed to be taxonomically restricted genes. Less than 1% of the sequences without straightforward homology could be traced back unambiguously to noncoding regions of Saccharomyces paradoxus, the closest species to S. cerevisiae, whereas thousands had homology signals against large coding regions in other sequences. Conservation signals of many of these sequences could be identified at genus level, and at phylum level for several others. Furthermore, we examined the sequences that matched against S. paradoxus, to identify possible mutations that could explain the origin of the starting and termination codons behind these young sORFs. Overall, our analysis produces new insights over the properties and evolutionary trajectories of small proteins both canonical and transient present in the S. cerevisiae genome and shows a comprehensive record of homology signals by comparing them against a diverse set of fungi species.

Fig. 1
figure 1

Fungi species showing their respective conservation levels according to the taxonomic distance to S. cerevisiae

Results

Regardless of their source, sequences follow a similar conservation distribution

After filtering for protein sequences with lengths less than or equal to 100 amino acids, we obtained a total of 336 sequences from NCBI, 613 from Scannell, and 20,023 from SmProt. We confirmed that SmProt-derived sequences did not contain SEPs that partially overlap with annotated coding sequences or without evidence of translation initiation. We checked whether there were differences in the proteins from the three groups and found that, on average, proteins from SmProt tended to be smaller than proteins from the other two groups. (ANOVA, p < 2e-16) (Fig. 2A).

Fig. 2
figure 2

A Average size of sequences in terms of the number of amino acids by source. B Sequences labeled according to the grouping into orthologous groups. Sequences belonging to an orthologous group were considered matched. C Size of conserved SEPs in terms of the number of amino acids by source. D Isoelectric points of conserved SEPs by source. E Nonsynonymous-to-synonymous substitution ratio (Ka/Ks) of conserved SEPs by source with a value less than 1. F GC content of conserved SEPs by source. Significance is denoted as * P < 0.05

After using ProteinOrtho [41] to group the sequences from each source into orthologous groups with sequences from the additional 17 species, 18 378 or 91% of the proteins from SmProt remained unmatched, whereas smaller fractions from NCBI and Scannell (28% and 36%, respectively) were not found to belong to orthologous groups (Fig. 2B). We found that the matched NCBI proteins were, on average, 73.65 amino acids in length, whereas the unmatched proteins were smaller, with 54.88 amino acids in length on average (t test, p < 2e-10) (Supplementary Fig. S1). In Scannell, the length of the matched proteins was, on average, 68.48 amino acids, whereas the length of the unmatched proteins was smaller, at 44.70 amino acids on average (t test, p < 2e-16) (Supplementary Fig. S1). Among the 20,023 ribo-seq-identified SEPs, the 1645 sequences that could be labeled this way were later reduced to 1134 when considering redundancy. These sequences from SmProt were, on average, 80.11 amino acids in length, whereas the unmatched proteins were smaller, at 21.33 amino acids in length (t test, p < 2e-16) (Supplementary Fig. S1).

On average, conserved SEPs from SmProt were larger than those found in NCBI and Scannell (ANOVA, p < 2e-16) (Fig. 2C). On average, SmProt SEPs had lower isoelectric points than did sequences from the other two sources (ANOVA, p < 2e-16) (Fig. 2D). After calculating their Ka/Ks ratios, we found 7 sequences in NCBI with evidence of diversifying selection (i.e., values greater than 1), 16 in Scannell and 15 in SmProt. When the remaining sequences with values lower than 1 were compared, SmProt SEPs were not under less evolutionary pressure than sequences from the other sources were (Fig. 2E) (ANOVA, p > 0.05). None of the groups presented differences in their GC content (Fig. 2F) (ANOVA, p > 0.05).

We were interested in studying these matched sequences according to the conservation level determined by the most distant species that shared at least one sequence in the orthologous group. Discounting conservation level 0, in NCBI the most numerous conservation levels were 4 and 13 (z test, p < 0.0001). In Scannell, levels 1 and 4, were the highest, (z test, p < 0.0001), followed by level 13 (z test, p < 0.001). In SmProt, 8 was significantly higher than the average (z test, p < 0.0001), followed by level 8. Conservation levels 5 and 6 were significantly lower in all the cases (z test, p < 0.0001) (Fig. 3 and supplementary Table S1).

Fig. 3
figure 3

Sequences of the three sources labeled by conservation level according to the farthest related species in the same orthologous group

To observe any potential pattern across the conservation levels, we examined the following properties: the average length, the strength of purifying selection, the isoelectric point, the frequency of each amino acid, the number of homologs (i.e., proteins in the same orthologous group belonging to S. cerevisiae), how many proteins belong to each orthologous group, the number of large sequences present in each orthologous group, the average length of proteins in each orthologous group, the difference between the length of the small proteins and the average length of large proteins in the same orthologous group and the proportion represented by large sequences in each orthologous group. We found that more conserved sequences were larger (NCBI, Mann‒Kendall test p = 0.03; Scannell Mann‒Kendall test p = 0.002; SmProt, Mann‒Kendall test p < 0.001) (supplementary Fig. S2). Sequences with lower conservation had higher Ka/Ks ratios (NCBI, Mann‒Kendall test p < 0.001; Scannell, Mann‒Kendall test p < 0.001; SmProt, Mann‒Kendall test p = 0.001) (Fig. 4A). Those that are better conserved tended to have more members in each orthologous group (NCBI, Mann‒Kendall test p < 0.001; Scannell, Mann‒Kendall test p < 0.001; SmProt, Mann‒Kendall test p < 0.001) (supplementary Fig. S3). Groups at higher conservation levels tend to contain more large proteins than those at lower conservation levels do (NCBI, Mann‒Kendall test p < 0.001; Scannell, Mann‒Kendall test p < 0.001; SmProt, Mann‒Kendall test p < 0.001) (supplementary Fig. S4). We found that for the three sources, the average length of orthologous proteins increased with conservation level in any of the datasets (NCBI, Mann‒Kendall test, p < 0.05; Scannell Mann‒Kendall test, p < 0.05; SmProt, Mann‒Kendall test, p < 0.001) (supplementary Fig. S5); however, the average length of large homologs only increased with conservation level for SmProt (NCBI, Mann‒Kendall test, p = 0.11; Scannell, Mann‒Kendall test, p = 0.04; SmProt, Mann‒Kendall test, p < 0.001) (supplementary Fig. S6). The other variables presented no significant trends (supplementary Fig. S7–S10).

Fig. 4
figure 4

A Nonsynonymous-to-synonymous substitution rate ratio by conservation level in SmProt SEPs. B Tau statistic of the Mann‒Kendall test for monotonic trends showing the frequency of each amino acid by conservation level. Significance is denoted as * P < 0.05. C Functional enrichment at the conservation level in NCBI. D Functional enrichment by conservation level in Scannell. E Functional enrichment by conservation level in SmProt

The frequency of glycine use was significantly greater in sequences with higher conservation levels for the three sources (NCBI, Mann‒Kendall test, p = 0.0162; Scannell, Mann‒Kendall test, p = 0.0375; SmProt, Mann‒Kendall test, p = 0.00302), as was aspartic acid (NCBI, Mann‒Kendall test, p = 0.0166; Scannell, Mann‒Kendall test, p = 0.00149; SmProt, Mann‒Kendall test, p = 0.00441). The frequency of lysine residues was found to be greater in sequences with greater conservation in SmProt sequences (Mann‒Kendall test p < 0.005), whereas serine and leucine residues were found to be greater in sequences with lower conservation levels than in SmProt sequences (Mann‒Kendall test p < 0.0001 for serine, Mann‒Kendall test p = 0.028 for leucine) (Fig. 4B).

Finally, sequences from the three sources were annotated with potential functions on the basis of alignment of their sequences. We assigned them functions based on the multiple database comparison offered by InterProScan V5.47 [42] and annotated them with gene ontology labels with eggNOG Mapper V5.02 [43]. NCBI small proteins at conservation level 14 and Scannell small proteins at conservation levels 1, 2 and 14 presented significant functional enrichment (Fig. 4C-D). We found that SmProt SEPs with conservation levels of 2, 3, 4, 7, 8, 10, 11, 12, 13 and 14 were enriched for some of the gene ontology labels obtained by eggNOG Mapper (supplementary Fig. S11), whereas those with conservation levels of 3, 4, 5, 6, 8, 13 and 14 were functionally enriched based on the labels obtained by InterProScan (Fig. 4E). Fewer conserved SEPs were enriched in intrinsically disordered regions, whereas better conserved SEPs were enriched in mitochondrial gene expression functions, diverse metabolic processes and histone-related functions (Fig. 4E).

Nonconserved SEPs homology signals

To further investigate the nature of the unmatched SEPs and search for subtle signals of conservation, we performed extensive homology searches against the entire genomes of our 17 study species. Our analysis revealed varying degrees of detectable homology across the different datasets against coding sequences (Table 1). Using TBLASTN and BLASTN we found that a notable fraction of the initial unmatched sequences from NCBI (17 out of 95), Scannell (73 out of 224), and SmProt (6692 out of 18,378) did, in fact, match known coding sequences. Moreover, since the origins of many orphan proteins or lineage-specific sequences (LSSs) can be traced back by observing the intergenic sequences of other species [44, 45], we extended the homology search against the non-coding regions of the genomes. As before, by applying TBLASTN and BLASTN, our analysis revealed that a portion of these sequences could be linked to non-coding origins (Table 1). We found matches for 9 NCBI sequences, 17 Scannell sequences, and a substantial 910 SmProt sequences. By observing the results against coding and non-coding regions, only one sequence in NCBI matched a non-coding region but no coding regions. This number was 0 for Scannell and 84 for SmProt (Fig. 5).

Fig. 5
figure 5

Summary of the classification of small proteins of each source according to their possible origin. a Proteins from NCBI. b Proteins from Scannell. c Proteins from SmProt. The matched sequences correspond to those with homology with sequences in the other 17 species. Unknown sequences correspond to sequences with no matches to any other sequences. Noncoding sequences correspond to sequences that match only noncoding sequences in other species. Large sequences correspond to sequences that matched large sequences in other species. Only small sequences correspond to sequences that matched small sequences in other species

Table 1 Homology signals of nonconserved SEPs against other species of fungi

As 18%, 33%, and 36% of sequences in NCBI, Scannell and SmProt that were originally unmatched in the previous section were found to have positive matches against the coding regions, we wanted to observe the size of the subject sequences to learn if these SEPs showed orthology to large or small sequences in other species. All the subject sequences with a significant hit were classified as either large or small depending on their size. The threshold was kept at 300 nucleotides. For Scannell, most of the subject sequences were large, as 367 were 300 nucleotides in length or more, and 33 were small. The same was observed in SmProt, where 13,105 were large and 2173 were small. Only for NBCI, most of the subject sequences were small, 452 were small, and 62 were large. On average, the length of subjects sequences matching the NCBI database (135 nt) was smaller than the average length of Scannell (3116 nt) and SmProt (1451 nt) sequences (ANOVA, p < 2e-16).

On the basis of the sequences aligned with at least one of the previous methods, 77 sequences from NCBI remained unmatched, 151 sequences from Scannell and 11,602 from SmProt. These numbers represent, respectively, 23%, 29% and 58% of the sequences (Fig. 5). These sequences were subsequently matched against the nonredundant database from NCBI to find homologies that were not previously accounted for given our Limited selection of species. SmProt had the most significant matches, 98, whereas Scannell had 42, and NCBI had 43.

Once the sequences were classified according to their homology signals, we were interested in analyzing their distribution on the basis of their length. Since many small sequences could result from the accumulation of random mutations, we created a set of randomly generated sequences on the basis of the GC content of the yeast noncoding regions. None of the sets of sequences from SmProt had a set that was similar to randomly generated sequences (Supplementary Fig. S12, Supplementary Table S2).

To further analyze the conservation of these sequences, we categorized them to a conservation level in the same manner it was done for the protein against protein alignment. The conservation level corresponding to the most taxonomically distant organism in which a successful match was found by either TBLASTN or BLASTN was retained. Because of this, each matched protein had four possible labels, one if it matched a coding sequence using BLASTN, one if it matched a coding sequences using TBLASTN, one if it matched a non-coding sequence using BLASTN and one if it matched a no-coding sequences using TBLASTN. Against coding sequences using BLASTN, for NCBI most sequences had conservation level 2 and 4 (Supplementary Fig. S13A). For Scannell, most had conservation 3 and for SmProt, most had conservation level 1, followed by conservation level 2 (Supplementary Fig. S13A). While using TBLASTN, in NCBI, conservation level 4 had the most matches for NCBI and SmProt, in the case of Scannell, level had the most matches (Supplementary Fig. S13B). For the three sources, matches were more conserved using TBLASTN (Supplementary Fig. S13B). Against non-coding sequences using BLASTN, for the sources level 1 had the most matches (Supplementary Fig. S13C). Using TBLASTN NCBI had the most matches in the conservation level 4, Scannell had the most matches in the conservation level 1 and SmProt had the most matches in the conservation level 8. Again, using TBLASTN higher conservation levels were achieved (Supplementary Fig. S13D).

ORF Triggers Explain Small Proteins Present in S. cerevisiae but Missing in S. paradoxus

Genes can be generated step by step from an accumulation of mutations that add the necessary attributes to establish a functional ORF. The sequence of enabling mutations, such as a starting codon or an in-frame termination codon, can be described by aligning a sequence to a species at a different level of taxonomic distance [46]. As we wanted to explore the possible mutations that explain these genes and how they are present in S. cerevisiae but absent in other species, we used MACSE to improve the alignment. The query sequence was grouped together with all the regions it was aligned with, encompassing both coding sequences and intergenic regions. This produced lists of closely related sequences that allowed comparisons of the differences between them and the S. cerevisiae coding sequences. Figure 6a and b serve as examples of the triggering mutations that generated a putatively new sORFs. The sequence from SmProt SPROSCE4770 had a hit against an intergenic region from S. paradoxus. The intergenic region in S. paradoxus is missing the final nucleotide to complete the stop codon (Fig. 6a). The sequence from SmProt SPROCE4719, on the other hand, also had a match with an intergenic region in S. paradoxus. This time, the first adenine is missing to complete the initiation codon (Fig. 6b). Since S. paradoxus is the closest species to S. cerevisiae, we analyzed the alignment of the small proteins against S. paradoxus. We took only the best significant hit and annotated whether the subject sequences belonged to coding or noncoding regions of the S. paradoxus genome. Similarly, we also annotated if the alignment was performed via TBLASTN or BLASTN (Supplementary Table S3). We counted how many sequences with a significant hit against the S. paradoxus genome had, as the first codon, a triplet distinct from the initiation codon found in the S. cerevisiae sequence. This same procedure was performed for the termination codon. We found that most SEPs from SmProt did not share the same starting codon with their homolog sequences in S. paradoxus, with only 1.5% of the sequences aligned by TBLASNT against coding regions having the same starting coding sequence. Among the sequences aligned via BLASTN, 2.8% had the same starting codon. This proportion was greater, at 34%, for sequences aligned against noncoding regions. In contrast, in alignments against coding sequences, more than half of the sequences, 54% for sequences aligned via TBLASTN and 70.2% for sequences aligned via BLASTN, shared the same termination codon. A total of 35–44% of the sequences that were aligned against noncoding regions of S. paradoxus had the same termination codon. Among the proteins that did not share the same starting codon, 76.8% of the pairs of sequences aligned via BLASTN against coding regions and 68.3% of the sequences aligned via TBLASTN against coding regions had, at the first position, a shift in the reading frame. For sequences aligned against noncoding regions via BLASTN, 54.6% had a frameshift, whereas for sequences aligned against noncoding regions via TBLASTN, 45.2% had a frameshift.

Fig. 6
figure 6

A Sequence SPROSCE4770 with positive match against an intergenic sequence from S. paradoxus. The intergenic sequence had a different star codon. B Sequence SPROSCE4719 with positive match against an intergenic sequence from S. paradoxus. The intergenic sequence had a different termination codon. C Frequencies of the 15 most common codons in the subject sequences of S. paradoxus matched against the query sequences from S. cerevisiae

For the total number of subject sequences in S. paradoxus, we inspected which starting and termination codons were most common. With respect to noncoding sequences, via BLASTN, TGA was the most common codon, followed by ATG. Using TBLASTN, ATG was the most common codon, followed by ATA. For BLASTN, TTG and TGA were the most common termination codons, and TTG and AAT ranked first for results obtained with TBLASTN (Fig. 6c). For alignments against coding regions, TGA and TGG ranked first for the start codon, regardless of the alignment method used. For the termination codon, in BLASTN, TGA was first used, followed by TAA, and in TBLASTN, the results were similar, with TAA first, followed by TGA (Fig. 6c). All of the most common codons for each category were enriched against the expected frequency of a uniform distribution (one proportion z test, p < 0.005).

Discussion

In recent years, it has become evident that sequences translated in eukaryotic genomes vastly outnumber the canonical annotated coding genes. Most of these are less than 100 amino acids in length [47]. Compared to larger sequences, proteins of these lengths are rather poorly studied. The reasons for this lack attention go back to the first sequencing studies in yeast when, due to technical Limitations of the technology available in the early 1990`s, it was agreed that the inclusion and analysis of these sequences was impractical since sequences of this size have a high probability of being artifactual ORFs produced just by chance, and because they greatly outnumber other sequences [2,3,4].

Even among the non-conserved subset of small proteins, some perform biological roles which have been experimentally verified [9]. However, it is not clear whether the conservation and properties of these small and hard-to-detect proteins are consistent with the characteristics displayed by small canonical protein-coding genes or if, conversely, are fundamentally distinct. In this work, we compared 20,023 sequences of putatively encoding small proteins against the entire set of canonical small proteins from two different reference annotations of the yeast genome. 1645 of these ribo-seq-detected sequences demonstrated high conservation across the fungal species we utilized. They had no differences in CG composition when compared with annotated sequences (Fig. 2F) and presented evidence of purifying evolution (Fig. 2E).

Inspecting the conservation patterns produced by the phylostratigraphic procedures, we found that some sequences detected by ribo-seq were conserved to fairly distant species, such as C. gattii or E. intestinalis (Fig. 3), which belong to the phyla Basidiomycota and Microsporidia, respectively, and not to Ascomycota, such as S. cerevisiae. The divergence between Ascomycota and Basidiomycota was estimated to have occurred approximately 1 808 million–400 million years ago [48], indicating that small sequences detected only by ribo-seq can be conserved across extremely large spans of time, further suggesting significant biological roles.

The patterns corresponding to each of the sources had similarities, with a significant fraction of them being conserved at the genus level, level 8, and level 13 (Fig. 3). However, interestingly, we observed that a larger percentage of the SEPs from SmProt were conserved at these levels, compared to the canonical small proteins from NCBI and Scannell (Fig. 3). We believe that a possible explanation for this is the positive relationship between the size of homologous sequences in other species and the conservation level observed in SmProt but not in NCBI and Scannell (Supplemental Fig. 4.). While a larger fraction of the SEPs match distant large proteins, canonical small proteins seem to be closely related to sequences of similar size which are of most recent origin. A more detailed analysis on the relationship between large proteins which are homologous to these SEPs would be necessary to clarify this phenomenon.

While examining similarities on their sequence composition, we found no differences among the three sources in terms of the GC content. Since previous studies have linked high-CG regions to the formation of novel genes [18], we could have expected differences if the sequences were less conserved. Similarly, when we analyzed the ratio between nonsynonymous and synonymous mutations, in search of evidence of purifying selection, we found that most SmProt sequences had a ratio below 1, which was not dissimilar to sequences from NCBI or Scannell. The sequences from the three sources presented a significant monotonic trend in which, as expected, the sequences with higher conservation levels presented lower Ka/Ks ratios (Fig. 4). The sequences that were found only to be shared between S. cerevisiae and S. paradoxus presented relatively high Ka/Ks ratios, as they are the youngest, evolutionarily speaking, conserved sequences. The enriched functions of the sequences differed relative to their estimated conservation level. Agreeing with previous research [49], newer sequences were rich in intrinsically disordered regions and were related to carbohydrate metabolic processes (Supplemental Fig. S11).

On the aspect of their differences, SmProt had, on average, a lower isoelectric point. Previously, some studies have shown that conserved sequences have lower isoelectric points than novel sequences do, which are depleted of acidic residues in yeast and flies [19]. In accordance with this observation, when we measured the amino acid composition, one of the two amino acids that showed a clear relationship with the conservation level across the three sets was aspartic acid. Unique for sequences derived from ribosome profiling, less conserved sequences had high contents of serine. The same pattern was observed in NCBI and Scannell; however, in those datasets, the trend was not statistically significant (Fig. 4B).

At the beginning of the yeast genome sequencing project, approximately one-third of the genes presented no similarity to other related organisms [50]. Although poor sequencing of related species is an important factor to consider as an explanation for the lack of homology, after less than two decades from those initial efforts, improved annotation and sequencing have revealed homology to genes previously considered to be orphan genes. On the other hand, working against it, this has uncovered more genes with no orthology. This can be observed in our data, since in both the NCBI and the Scannell data, approximately one-third could not be matched. However, the use of TBLASTN can detect related sequences that are not findable by the usual BLASTN [51]. Furthermore, given that proteins can evolve out of noncoding regions [52, 53], we explored their relationships with both the coding regions and intergenic regions of 17 species of genes, expecting to find similar sequences. Using the translation of the genomic sequences on all six frames of reference, we were able to find similarities otherwise impossible in protein-to-protein comparisons. Our analysis revealed that, via TBLASTN, 17 sequences from NCBI, 73 from Scannell, and 6692 from SmProt could be aligned, sometimes against fairly distant organisms at classification level 14 or 13 for Scannell and SmProt (Supplementary Fig. S13B). This strong conservation, we suspect, may suggest that these proteins possess important functionality

Many of the sequences in SmProt match large sequences from other species. Similar to previous results, a significant fraction of nonconserved ORFs overlap with canonical sequences [9]. These results seem to indicate that a source of small proteins is the appearance of a new start codon, usually in an alternative reading frame, owing to a missense mutation nested within preexisting large ORFs. This mechanism has been previously reported as an important method for determining the origin of sequences in bacteria [54]. Notably, small proteins that overlap with other ORFs partially or totally have high rates of detectable homologies, which are higher than those of other small protein biotypes [55]. Further corroborating these observations, when we compared the best matches against S. paradoxus, we observed that the most common difference was the starting codon, which is usually absent in S. paradoxus.

Regarding the changes required to obtain a starting codon or a termination codon, subject sequences, from both TBLASTN and BLASTN, tended to start with TGA, TGG, TGT and TGC (Fig. 6), and since frameshifts in the starting codon were ubiquitous, we suspect that the most common mutation to obtain a starting codon ATG is the insertion of an adenine in the first position. In addition to TGX, the second most common codon was AGA, which requires the insertion of a thymine in the second position. In the case of alignments against the noncoding sequences, we found that a different starting codon was less common and was later shown by inspecting the frequency of the starting codons. We found that for BLASTN alignments, TGA ranked first, followed closely by ATG. For TBLASN, ATG ranked first, with ATA as the second one, suggesting that a change from adenine to guanine in the third position is a common mutation to obtain the ORF. TGG and TGA also ranked high. Here, again, the insertion of an adenine in the first position is a suitable explanation.

In the case of the most common termination codons, for alignments against coding sequences, since different termination codons were not common, unsurprisingly, TAA ranked very high, either second or first. The other common codon was TGA, suggesting that the common mutation GC-to-AT [56] is responsible for a large part of the new termination codons. For alignments against intergenic regions, where different termination codons were more prevalent, TTG ranked first for both alignments. CTA, ATA, GTA and TTA are also not uncommon, which accounts for the frameshifts required for the termination codon TAA.

One limitation in our project is that, to represent well annotated sequences, both annotations are based on the genome of the same strain, S288C [36, 38]. This reference genome was selected to offer a consensus representative for the species and it is a standard procedure to compare newly sequenced genomes of S. cerevisiae against it [37]. However, S. cerevisiae is a complex species with substantial genetic diversity across its strains which vary between wild populations, those used for industrial, clinical or laboratory purposes [57] and by geography [58]. Since S288C lacks genes characterized in other strains or differs on the numbers of sequences belonging to a family gene [37], it would be reasonable to assume that the non-canonical translatome could present important differences across strains. Further analysis on the conservation patterns and differences in the non-canonical SEPs could generate insights into the variability of the species, the evolution of these sequences and their functional relevance by linking them to ecological roles and conditions in which each strain is found.

Finally, our methods left 77, 161 and 11 602 sequences without identified homology to any of the other species included in this study for NCBI, Scannell and SmProt, respectively. Among these sequences, 43, 42 and 98 matched sequences present in the nonredundant database from NCBI, respectively. Many of the sequences matched corresponded to genera of bacteria, but fish and mammals were also included. These potential homologies among very distantly related organisms could represent genes that have been passed not by their last common ancestor but by horizontal transfer [59] or by genome contamination, which has been previously identified to be present in the nonredundant database from NCBI [60]. Nonetheless, to further prove those origins, a more robust analysis that is out of the scope of this paper is needed. Those sequences that did not match whatsoever could very well be cataloged as LSS.

Conclusions

In this work, we constructed a profile for SEPs belonging to the baker’s yeast genome, most of which were sequences discovered via ribo-seq profiling. Using a combination of phylostratigraphic methods, those conserved ribosome profiling sequences showed similarities in the conservation patterns to those of the canonical small proteins. These 1134 sequences are the main candidates to have levels of functionality akin to those of canonical proteins, as they also presented similar lengths, similar GC contents and had recognizable domains that were applied for functional annotation. We found that nonconserved SEPs are significantly shorter than conserved SEPs and that a subset of even shorter sequences does not have known homology. With respect to the origin of these sequences, many have homology signals that are hard to detect, as it appears that they are expressed in alternative frames of reading, provoked by insertions, which create new open reading frames not found in closely related species. Many of these homologies were found only by applying TBLASTN. Additionally, 84 of this SmProt sequences were found to be closely related only to noncoding sequences, highlighting that noncoding sequences can originate proteins, usually very reduced in length. Our work supports the emerging framework that small proteins represent a larger fraction of the functional proteome of yeast and potentially many more organisms. Once, sequences that were considered irrelevant because of an arbitrary cut are now a recognized part of the genome; now, sequences that are hard to detect by proteomic methods have also been disregarded as nonfunctional, but their conservation patterns, very similar to those of canonical protein-coding sequences and, sometimes, spanning across different phyla, suggest otherwise.

Materials and methods

Data collection

We analyzed small proteins from three sets. The first one is the NCBI reference R64 genome of yeast. We also included the set of genome annotations of S. sensu strictu by Scannell et al. 2011 because this resource was made available with the intention of facilitating studies of yeast gene evolution [38]. We also included the data from the second version of SmProt since this database specializes in collecting small proteins from model organisms and is the largest to date [22, 23]. From each source, every sequence with a length over 100 amino acids (aa) was discarded. In SmProt, three sources are cited for the construction of their database for yeast: literature mining, known databases and ribosome profiling. Sequences labeled as literature mining and known databases were removed.

For the alignment and phylostratigraphy steps, we chose 18 species of fungi that represent organisms at different evolutionary distances from S. cerevisiae. The species were selected on the basis of the species used in previous phylostratigraphic studies concerning the conservation and evolution of de novo proteins in yeast [12, 61, 62]. We included the sister species S. cerevisiae and S. paradoxus; species belonging to phylum Basidiomycota, Agaricus bisporus and Criptococcus gatti; and phylum Microsporidia Encephalitozoon intestinalis (Fig. 1). From NCBI, we downloaded the GFF, protein sequences, genomic sequences, and coding sequences from the following species of fungi: S. cerevisiae, Candida glabrata, Debaryomyces hansenii, Kluyveromyces lactis, Ashbya gossypii, Naumovozyma castellii, Tetrapisispora phaffii, Yarrowia lipolytica, Aspergillus nidulans, Neurospora crassa, Schizosaccharomyces pombe, Agaricus bisporus, Criptococcus gattii, and E. intestinalis (NCBI accession codes in Supplemental Table S4). From Scannell et al., 2011, we downloaded the genomic sequences, protein sequences, GFFs and intergenic regions from S. cerevisiae. S. paradoxus, S. mikateae IFO 1815 T, S. kudriadzevii IFO 1802 and S. bayanus var. uvarum CBS 7001 (S. uvarum).

Alignment procedure

For the first alignment step, ProteinOrtho from the Galaxy Europe Server [41, 63] was used to classify yeast protein sequences from each source into orthologous groups. ProteinOrtho also includes paralogs in orthologous groups [41]. The e value threshold was set at 0.0001, the minimal coverage of the best blast alignment percentage was set to 50, and the minimal percent identity of the best blast hits percentage was set to 25. Every protein sequence from the 17 additional species was included regardless of its size. After alignment, the proteins from S. cerevisiae that belonged to an orthologous group were classified as matched. Those with no groups were classified as unmatched. Differences in length between matched and unmatched sequences were analyzed via Student’s t test. Differences in the distribution of length were analyzed via the Wilcoxon rank sum test. Statistical analyses were performed in R statistical software v4.3.1 [64]. Additional visualization was performed with ggstatsplot [65].

Age classification of sequences matched in orthologous groups

For the matched sequences, we followed a phylostratigraphic method that has been used in a wide range of research to approximate the evolutionary age of each sequence according to the phylogenetic distance of the species that share orthologous sequences [12, 45, 66,67,68]. A conservation level was assigned to each of the species according to their distance from S. cerevisiae on the basis of the current understanding of the phylogenetics of the fungal phylum [12, 69,70,71,72]. The species tree was constructed using the protein sequences of all the involved species employing Orthofinder 2.5.5 [73] with all the parameters used as default. The adequacy of the resulted tree compared with the taxonomy and phylogeny presented in current literature [61, 72, 74]. If the distance was zero (i.e., unmatched sequences or sequences of S. cerevisiae that only matched with sequences of S. cerevisiae), the age was 0, whereas if the sequences shared orthology with the most distant organism we used, in this case, E. intestinalis, 14 was assigned. Each protein sequence matched via ProteinOrtho was classified by age according to the most distant species for which a significant alignment hit was obtained. For each source, we grouped the sequences by their age classification. The differences in small proteins per phylostratigraphic level against the average were calculated via a z test.

Afterwards, we proceeded to analyze several metrics, including the following: (1) The average length, (2) The number of homologs in S. cerevisiae, (3) The number of large proteins (more than 100 aa in length) that were orthologous to small proteins; (4) The total number of proteins with orthology for each small protein was (5) The average of each group, (6) The average amino acid length of the proteins within each orthologous family and (7) The average amino acid length for the large proteins within each orthologous family differed. The Mann‒Kendall test was used to find significant trends following the conservation level as an ordinate variable. The redundancy of sequences was identified via CD-HIT with the parameters –c 0.95 and 0.9 g 1 [75]. Multiple sequence alignments of proteins were converted into their corresponding codon sequences via PAL2NAL [76]. Additionally, the functions computePI and kaks from the seqinr package [77] in R were used to predict the isoelectric point of all the small proteins and the Ka and Ks values of all the pairs of homologous proteins, respectively.

Domain and function annotation was performed via sequence comparison with InterProScan V5.47 [42]. Gene ontology annotation was performed with eggNOG Mapper V5.02 [43] with a scoring matrix BLOSUM62, a minimum e-value threshold of 0.001 and the use of only nonelectronic curated terms. Enrichment analysis for both functional and GO terms was performed with the R package clusterProfiler, with p values adjusted via the Benjamini‒Hochberg procedure [78].

Alignment of unmatched sequences against coding and intergenic regions

We used the protein sequences and their corresponding coding nucleotide sequences from the proteins that were not matched during the ProteinOrtho step. TBLASTN was used to align these protein sequences against the coding and intergenic regions of the additional species. Subsequently, BLASTN was used to align the corresponding nucleotide sequences against the coding and intergenic databases. An E value of 0.0001 was used for each of the alignments. We applied the phylostratigraphic approach again to label every sequence according to the most distantly related organism matched.

Inference of potential ORF triggers

Any type of indel or substitution that can create a new ORF from an ancestral region is called a trigger or enabler [79]. To identify these enablers, we adapted part of the pipeline used in a previous study by Li Zhang et al. (2019). Using MACSE, with the parameters -fs 100.0 -fs_lr 20.0 -stop 100.0 -stop_lr 10.0, we aligned the clusters of orthologous sequences. MACSE alignment represents frameshifts using “!” and identifies premature stop codons with “*”. On the basis of this alignment, we labeled them according to the missing features or differences from the S. cerevisiae sequence. The labels included different start codons, different termination codons and frameshifts in the start codons or frameshifts in the termination codons.

To analyze sequences according to their more recent changes and to analyze the closest organisms where noncoding regions share orthology with coding sequences, we also annotated the sequences with the lowest conservation level. Therefore, if a sequence matches both S. paradoxus and S. mikatae, which represent conservation levels 1 and 2, respectively, the lowest conservation level corresponds to 1 (Fig. 1). For the comparison between S. paradoxus and S. cerevisiae, if sequences that got multiple matches against S. paradoxus, whether it was against coding regions or non-coding regions, only the best hit was selected.

Sequences that failed to present any significant hit with any of the BLAST databases were subsequently aligned to the nonredundant database from NCBI to recognize possible homologies with either closely or distantly related species. Any match against sequences from S. cerevisiae was removed. Sequences that had no match whatsoever were considered to have an unknown origin.

Simulation of de Novo ORF by chance

We extracted the intergenic sequences from the S. cerevisiae reference genome. The average number of these intergenic sequences was 8057.42 base pairs. Using our own script, we measured the GC content of each of the sequences. The result was A: T:C: G = 0.31:0.31:0.19:0.19. We generated 1 000 000 random sequences with the same average length size and with the same A: T:C: G content. Using the AUGUSTUS algorithm [80], we predicted ORFs in these sequences. Additionally, we generated 1 000 000 sequences with random GC content and the same average length size. Kolmogorov‒Smirnov tests were used to compare the length distributions of the predicted ORFs against those of the small proteins from our three sources. The p values were adjusted via the Bonferroni correction method.

Data availability

Availability of data and materialsNCBI data are available athttps://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000146045.2/The SmProt data can be downloaded from http://bigdata.ibp.ac.cn/SmProt/download.htmData from Scannell et al. 2011 are available at http://sss.genetics.wisc.edu/cgi-bin/s3.cgi.

References

  1. Velculescu VE, Zhang L, Zhou W, Vogelstein J, Basrai MA Jr, Hieter P, Kinzler KW. Characterization of theYeast Transcriptome. Cell. 1997;88:243–51.

  2. Dinger ME, Pang KC, Mercer TR, Mattick JS. Differentiating protein-coding and noncoding RNA: challenges and ambiguities. PLoS Comput Biol. 2008. https://doi.org/10.1371/journal.pcbi.1000176.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Guerra-Almeida D, Nunes-da-Fonseca R. Small open reading frames: how important are they for molecular evolution?? Front Genet. 2020;11:1–6. https://doi.org/10.3389/fgene.2020.574737.

    Article  CAS  Google Scholar 

  4. Basrai MA, Hieter P, Boeke JD. Small open reading frames: beautiful needles in the haystack. Genome Res. 1997;7:768–71.

    Article  CAS  PubMed  Google Scholar 

  5. Kastenmayer JP, Ni L, Chu A, Kitchen LE, Au W, Yang H, et al. Functional genomics of genes with small open reading frames (sORFs) in S. cerevisiae. Genome Res. 2006;16:365–73. https://doi.org/10.1101/gr.4355406.7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Frith MC, Forrest AR, Nourbakhsh E, Pang KC, Kai C, Kawai J, et al. The abundance of short proteins in the mammalian proteome. PLoS Genet. 2006;2:515–28. https://doi.org/10.1371/journal.pgen.0020052.

    Article  CAS  Google Scholar 

  7. Kondo T, Plaza S, Zanet J, Benrabah E, Valenti P, Hashimoto Y, et al. Small peptides switch the transcriptional activity of Shavenbaby during drosophila embryogenesis. Science. 2010;329:336–9. https://doi.org/10.1126/science.1188158.

    Article  CAS  PubMed  Google Scholar 

  8. Hanada K, Higuchi-Takeuchi M, Okamoto M, Yoshizumi T, Shimizu M, Nakaminami K, et al. Small open reading frames associated with morphogenesis are hidden in plant genomes. Proc Natl Acad Sci U S A. 2013;110:2395–400. https://doi.org/10.1073/pnas.1213958110.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Wacholder A, Parikh SB, Coelho NC, Acar O, Houghton C, Chou L, et al. A vast evolutionarily transient translatome contributes to phenotype and fitness. Cell Syst. 2023;14:363–e3818. https://doi.org/10.1016/j.cels.2023.04.002.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Couso J, Patraquim P. Classification and function of small open reading frames. Nat Rev Mol Cell Biol. 2017;18:575–89. https://doi.org/10.1038/nrm.2017.58.

    Article  CAS  PubMed  Google Scholar 

  11. Kastenmayer JP, Ni L, Chu A, Kitchen LE, Au WC, Yang H, et al. Functional genomics of genes with small open reading frames (sORFs) in S. cerevisiae. Genome Res. 2006;16:365–73. https://doi.org/10.1101/gr.4355406.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Carvunis A, Rolland T, Wapinski I, Calderwood MA, Yildirim MA, Hidalgo A, et al. Proto-genes and de Novo gene birth. Nature. 2012;487:3–7. https://doi.org/10.1038/nature11184.

    Article  CAS  Google Scholar 

  13. Baena-Angulo C, Platero AI, Couso JP. Cis to trans: small ORF functions emerging through evolution. Trends Genet. 2024;41:119–31. https://doi.org/10.1016/j.tig.2024.10.012.

    Article  CAS  PubMed  Google Scholar 

  14. Parikh SB, Houghton C, Oss SB, Van, Carvunis AR. Origins, evolution, and physiological implications of de Novo genes in yeast. Yeast Extr 2022:471–81. https://doi.org/10.1002/yea.3810

  15. Lipman DJ, Souvorov A, Koonin EV, Panchenko AR, Tatusova TA. The relationship of protein conservation and sequence length. BMC Evol Biol. 2002;10:1–10.

    Google Scholar 

  16. Jin G, Ma PF, Wu X, Gu L, Long M, Zhang C, et al. New genes interacted with recent whole-genome duplicates in the fast stem growth of bamboos. Mol Biol Evol. 2021;38:5752–68. https://doi.org/10.1093/molbev/msab288.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Dowling D, Schmitz JF, Bornberg-bauer E. Stochastic gain and loss of novel transcribed open reading frames in the human lineage. Genome Biol Evol. 2020;12:2183–95. https://doi.org/10.1093/gbe/evaa194.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Vakirlis N, Hebert AS, Opulente DA, Achaz G, Hittinger CT, Fischer G, et al. A molecular portrait of de novo genes in yeasts. Mol Biol Evol. 2017;35:631–45. https://doi.org/10.1093/molbev/msx315.

    Article  CAS  PubMed Central  Google Scholar 

  19. Montañés JC, Huertas M, Messeguer X, Albà MM. Evolutionary trajectories of new duplicated and putative de novo genes. Mol Biol Evol. 2023;40:1–16. https://doi.org/10.1093/molbev/msad098.

    Article  CAS  Google Scholar 

  20. Aspden JL, Eyre-walker YC, Phillips RJ, Amin U, Mumtaz MAS, Brocard M et al. Extensive translation of small open reading frames revealed by Poly-Ribo-Seq. Elife 2014:1–19. https://doi.org/10.7554/eLife.03528

  21. Verbruggen S, Verhegen K, Olexiouk V, Crapp J, Martens L, Menschaert G. sORFs. Org : a repository of small ORFs identified by ribosome profiling. Nucleic Acids Res. 2016;44:324–9. https://doi.org/10.1093/nar/gkv1175.

    Article  CAS  Google Scholar 

  22. Hao Y, Zhang L, Niu Y, Cai T, Luo J, He S, et al. SmProt: a database of small proteins encoded by annotated coding and non-coding RNA loci. Brief Bioinform. 2018;19:636–43. https://doi.org/10.1093/bib/bbx005.

    Article  CAS  PubMed  Google Scholar 

  23. Li Y, Zhou H, Chen X, Zheng Y, Kang Q, Hao D, et al. SmProt: a reliable repository with comprehensive annotation of small proteins identified from ribosome profiling. Genomics Proteomics Bioinformatics. 2021;19:602–10. https://doi.org/10.1016/j.gpb.2021.09.002.

    Article  PubMed  PubMed Central  Google Scholar 

  24. Ahrens CH, Wade JT, Champion MM, Langer JD. A practical guide to small protein discovery and characterization using mass spectrometry. J Bacteriol. 2022. https://doi.org/10.1128/jb.00353-21.

    Article  PubMed  PubMed Central  Google Scholar 

  25. Vazquez-Laslop N, Sharma CM, Mankin A, Buskirk AR. Identifying small open reading frames in prokaryotes with ribosome profiling. J Bacteriol. 2022. https://doi.org/10.1128/JB.00294-21.

    Article  PubMed  PubMed Central  Google Scholar 

  26. Mudge JM, Ruiz-Orera J, Prensner JR, Brunet MA, Calvet F, Jungreis I, et al. Standardized annotation of translated open reading frames. Nat Biotechnol. 2022;40:994–9. https://doi.org/10.1038/s41587-022-01369-0.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Wright BW, Yi Z, Weissman JS, Chen J. The dark proteome: translation from noncanonical open reading frames. Trends Cell Biol. 2022;32(3):243–58. https://doi.org/10.1016/j.tcb.2021.10.010.

    Article  CAS  PubMed  Google Scholar 

  28. Patraquim P, Mumtaz MAS, Pueyo JI, Aspden JL, Couso JP. Developmental regulation of canonical and small ORF translation from mRNAs. Genome Biol. 2020;21:1–26. https://doi.org/10.1186/s13059-020-02011-5.

    Article  CAS  Google Scholar 

  29. Wacholder A, Carvunis AR. Biological factors and statistical limitations prevent detection of most noncanonical proteins by mass spectrometry. PLoS Biol. 2023;21:1–27. https://doi.org/10.1371/journal.pbio.3002409.

    Article  CAS  Google Scholar 

  30. Struhl K. Transcriptional noise and the fidelity of initiation by RNA polymerase II. Nat Struct Mol Biol. 2007;14:103–5. https://doi.org/10.1038/nsmb0207-103.

    Article  CAS  PubMed  Google Scholar 

  31. Ponjavic J, Ponting CP, Lunter G. Functionality or transcriptional noise? Evidence for selection within long noncoding RNAs. Genome Res. 2007;17:556–65. https://doi.org/10.1101/gr.6036807.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Robinson R. Dark matter transcripts: sound and fury. Signifying nothing? PLoS Biol. 2010;8:e1000370. https://doi.org/10.1371/journal.pbio.1000370.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Rich A, Acar O, Carvunis AR. Massively integrated coexpression analysis reveals transcriptional regulation, evolution and cellular implications of the yeast noncanonical translatome. Genome Biol. 2024;25:1–28. https://doi.org/10.1186/s13059-024-03287-7.

    Article  CAS  Google Scholar 

  34. Patraquim P, Magny EG, Pueyo JI, Platero AI, Couso JP. Translation and natural selection of micropeptides from long non-canonical RNAs. Nat Commun. 2022. https://doi.org/10.1038/s41467-022-34094-y.

    Article  PubMed  PubMed Central  Google Scholar 

  35. Chen J, Brunner AD, Cogan JZ, Nuñez JK, Fields AP, Adamson B, et al. Pervasive functional translation of noncanonical human open reading frames. Sci (80-). 2020;367:140–6. https://doi.org/10.1126/science.aav5912.

    Article  CAS  Google Scholar 

  36. Engel SR, Aleksander S, Nash RS, Wong ED, Weng S, Miyasato SR, et al. Saccharomyces genome database: advances in genome annotation, expanded biochemical pathways, and other key enhancements. Genetics. 2024;229:1–7. https://doi.org/10.1093/genetics/iyae185.

    Article  CAS  Google Scholar 

  37. Engel SR, Dietrich FS, Fisk DG, Binkley G, Balakrishnan R, Costanzo MC, et al. The reference genome sequence of Saccharomyces cerevisiae: then and now. G3 Genes Genomes Genet. 2014;4(3):389–98. https://doi.org/10.1534/g3.113.008995.

    Article  CAS  Google Scholar 

  38. Scannell DR, Zill OA, Rokas A, Payen C, Dunham MJ, Eisen MB, et al. The awesome power of yeast evolutionary genetics: new genome sequences and strain resources for the Saccharomyces sensu stricto genus. G3&#58; Genes|Genomes|Genetics. 2011;1(1):11–25. https://doi.org/10.1534/g3.111.000273.

    Article  CAS  PubMed Central  Google Scholar 

  39. Domazet-Lošo T, Tautz D. Phylostratigraphic tracking of cancer genes suggests a link to the emergence of multicellularity in metazoa. BMC Biol. 2010;8:1–10. https://doi.org/10.1186/1741-7007-8-66.

    Article  CAS  Google Scholar 

  40. Tautz D, Domazet-lošo T. The evolutionary origin of orphan genes. Nat Rev Genet. 2011. https://doi.org/10.1038/nrg3053.

    Article  PubMed  Google Scholar 

  41. Lechner M, Findeiß S, Steiner L, Marz M, Stadler PF, Prohaska SJ. Proteinortho: detection of (Co-) orthologs in large-scale analysis. BMC Bioinformatics. 2011;12:124. https://doi.org/10.1186/1471-2105-12-124.

  42. Jones P, Binns D, Chang HY, Fraser M, Li W, McAnulla C, et al. InterProScan 5: Genome-scale protein function classification. Bioinformatics. 2014;30:1236–40. https://doi.org/10.1093/bioinformatics/btu031.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Huerta-Cepas J, Szklarczyk D, Heller D, Hernández-Plaza A, Forslund SK, Cook H, et al. EggNOG 5.0: A hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 2019;47:D309–14. https://doi.org/10.1093/nar/gky1085.

    Article  CAS  PubMed  Google Scholar 

  44. Khalturin K, Hemmrich G, Fraune S, Augustin R, Bosch TCG. More than just orphans: are taxonomically-restricted genes important in evolution? Trends Genet. 2009;25:404–13. https://doi.org/10.1016/j.tig.2009.07.006.

    Article  CAS  PubMed  Google Scholar 

  45. Domazet-Lošo T, Tautz D. A phylogenetically based transcriptome age index mirrors ontogenetic divergence patterns. Nature. 2010;468:815–9. https://doi.org/10.1038/nature09632.

    Article  CAS  PubMed  Google Scholar 

  46. Zhang L, Ren Y, Yang T, Li G, Chen J, Gschwend AR, et al. Rapid evolution of protein diversity by de novo origination in Oryza. Nat Ecol Evol. 2019;3:679–90. https://doi.org/10.1038/s41559-019-0822-5.

    Article  PubMed  Google Scholar 

  47. Papadopoulos C, Arbes H, Chevrollier N, Blanchet S, Cornu D, Roginski P, et al. The ribosome profiling landscape of yeast reveals a high diversity in pervasive translation. Genome Biol. 2023;2023(0316532990). https://doi.org/10.1186/s13059-024-03403-7.

  48. Taylor JW, Berbee ML. Dating divergences in the fungal tree of life: review and new analyses. Mycologia. 2006;98:838–49. https://doi.org/10.3852/mycologia.98.6.838.

    Article  PubMed  Google Scholar 

  49. Mackowiak SD, Zauber H, Bielow C, Thiel D, Kutz K, Calviello L et al. Extensive identification and analysis of conserved small ORFs in animals. Genome Biol 2015:1–21. https://doi.org/10.1186/s13059-015-0742-x

  50. Dujon B. The yeast genome project: what did we learn? Trends Genet. 1996;12:263–70. https://doi.org/10.1016/0168-9525(96)10027-5.

    Article  CAS  PubMed  Google Scholar 

  51. Palmieri N, Kosiol C, Schlötterer C. The life cycle of drosophila orphan genes. Elife. 2014;3:1–21. https://doi.org/10.7554/elife.01311.

    Article  Google Scholar 

  52. Cai J, Zhao R, Jiang H, Wang W. De novo origination of a new protein-coding gene in Saccharomyces cerevisiae. Genetics. 2008;179:487–96. https://doi.org/10.1534/genetics.107.084491.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Ruiz-Orera J, Hernandez-Rodriguez J, Chiva C, Sabidó E, Kondova I, Bontrop R, et al. Origins of de Novo genes in human and chimpanzee. PLoS Genet. 2015;11:1–24. https://doi.org/10.1371/journal.pgen.1005721.

    Article  CAS  Google Scholar 

  54. Gray T, Storz G, Papenfort K. Small proteins; big questions. J Bacteriol. 2022. https://doi.org/10.1128/JB.00341-21.

    Article  PubMed  PubMed Central  Google Scholar 

  55. Sandmann CL, Schulz JF, Ruiz-Orera J, Kirchner M, Ziehm M, Adami E et al. Evolutionary origins and interactomes of human, young microproteins and small peptides translated from short open reading frames. Mol Cell. 2023;83:994–1011. https://doi.org/10.1016/j.molcel.2023.01.023

  56. Liu H, Zhang J. Yeast spontaneous mutation rate and spectrum vary with environment. Curr Biol. 2019;29:1584–91. https://doi.org/10.1016/j.cub.2019.03.054. .e3.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Kang K, Bergdahl B, MacHado D, Dato L, Han TL, Li J, et al. Linking genetic, metabolic, and phenotypic diversity among Saccharomyces cerevisiae strains using multi-omics associations. Gigascience. 2019;8:1–14. https://doi.org/10.1093/gigascience/giz015.

    Article  CAS  Google Scholar 

  58. Wang QM, Liu WQ, Liti G, Wang SA, Bai FY. Surprisingly diverged populations of Saccharomyces cerevisiae in natural environments remote from human activity. Mol Ecol. 2012;21:5404–17. https://doi.org/10.1111/j.1365-294X.2012.05732.x.

    Article  PubMed  Google Scholar 

  59. Wissler L, Gadau J, Simola DF, Helmkampf M, Bornberg-Bauer E. Mechanisms and dynamics of orphan gene emergence in insect genomes. Genome Biol Evol. 2013;5:439–45. https://doi.org/10.1093/gbe/evt009.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Steinegger M, Salzberg SL. Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank. Genome Biol. 2020;21:1–12. https://doi.org/10.1186/s13059-020-02023-1.

    Article  CAS  Google Scholar 

  61. Ekman D, Elofsson A. Identifying and quantifying orphan protein sequences in fungi. J Mol Biol. 2010;396:396–405. https://doi.org/10.1016/j.jmb.2009.11.053.

    Article  CAS  PubMed  Google Scholar 

  62. Lu TC, Leu JY, Lin WC. A comprehensive analysis of transcript-supported de novo genes in Saccharomyces sensu stricto yeasts. Mol Biol Evol. 2017;34:2823–38. https://doi.org/10.1093/molbev/msx210.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. The Galaxy Community. The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2020 update. Nucleic Acids Res. 2022;50:W395-402.

    Article  Google Scholar 

  64. R Core Team. R: A Language and environment for statistical computing. Austria: R Found Stat Comput Vienna; 2021.

    Google Scholar 

  65. Patil I. Visualizations with statistical details: the Ggstatsplot approach. J Open Source Softw. 2021;6(61):3167. https://doi.org/10.21105/joss.03167.

    Article  Google Scholar 

  66. Toll-Riera M, Bosch N, Bellora N, Castelo R, Armengol L, Estivill X, et al. Origin of primate orphan genes: A comparative genomics approach. Mol Biol Evol. 2009;26:603–12. https://doi.org/10.1093/molbev/msn281.

    Article  CAS  PubMed  Google Scholar 

  67. Zhang L, Tan Y, Fan S, Zhang X, Zhang Z. Phylostratigraphic analysis of gene co-expression network reveals the evolution of functional modules for ovarian cancer. Sci Rep. 2019;9:1–12. https://doi.org/10.1038/s41598-019-40023-9.

    Article  CAS  Google Scholar 

  68. Sogabe S, Hatleberg WL, Kocot KM, Say TE, Stoupin D, Roper KE, et al. Pluripotency and the origin of animal multicellularity. Nature. 2019;570:519–22. https://doi.org/10.1038/s41586-019-1290-4.

    Article  CAS  PubMed  Google Scholar 

  69. Sulo P, Szaboova D, Bielik P, Polakova S, Soltys K, Jatzova K, et al. The evolutionary history of Saccharomyces species inferred from completed mitochondrial genomes and revision in the ‘ yeast mitochondrial genetic code’. DNA Res. 2017;24:571–83. https://doi.org/10.1093/dnares/dsx026.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Scannell DR, Butler G, Wolfe KH. Yeast genome evolution — the origin of the species. Yeast. 2008;191–8. https://doi.org/10.1002/yea.

  71. Kurtzman CP, Robnett CJ. Phylogenetic relationships among yeasts of the ‘Saccharomyces complex’ determined from multigene sequence analyses. FEMS Yeast Res. 2003;3:417–32. https://doi.org/10.1016/S1567-1356(03)00012-6.

    Article  CAS  PubMed  Google Scholar 

  72. Marcet-Houben M, Gabaldón T. Beyond the whole-genome duplication: phylogenetic evidence for an ancient interspecies hybridization in the baker’s yeast lineage. PLoS Biol. 2015;13:1–26. https://doi.org/10.1371/journal.pbio.1002220.

    Article  CAS  Google Scholar 

  73. Emms DM, Kelly S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 2015;16:1–14. https://doi.org/10.1186/s13059-015-0721-2.

    Article  CAS  Google Scholar 

  74. Alsammar H, Delneri D. An update on the diversity, ecology and biogeography of the Saccharomyces genus. FEMS Yeast Res. 2020;20:1–12. https://doi.org/10.1093/femsyr/foaa013.

    Article  Google Scholar 

  75. Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–9. https://doi.org/10.1093/bioinformatics/btl158.

    Article  CAS  PubMed  Google Scholar 

  76. Suyama M, Torrents D, Bork P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 2006;34:609–12. https://doi.org/10.1093/nar/gkl315.

    Article  CAS  Google Scholar 

  77. Charif D, Lobry JR. SeqinR 1.0–2: A contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis. Struct. Approaches Seq Evol. 2007;207–32. https://doi.org/10.1007/978-3-540-35306-5_10.

  78. Yu G, Wang LG, Han Y, He QY. Clusterprofiler. An R package for comparing biological themes among gene clusters. OMICS. 2012;16:284–7. https://doi.org/10.1089/omi.2011.0118.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  79. Knowles DG, Mclysaght A. Recent de Novo origin of human protein-coding genes. Genome Res. 2009;19:1752–9. https://doi.org/10.1101/gr.095026.109.1752.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  80. Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003;19:ii215–25. https://doi.org/10.1093/bioinformatics/btg1080.

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

We thank Professor Chuan Xu and Ernest Liu for their comments and fruitful discussion points, which helped improve this manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (32170664 and 31871329, 42327805), the Key Project for Computational Biology of Shanghai (23JS1400800) and the Fundamental Research Funds for the Central Universities (YG2023ZD11). The computations in this paper were run on the Siyuan-1 cluster supported by the Center for High Performance Computing at Shanghai Jiao Tong University.

Author information

Authors and Affiliations

Authors

Contributions

CRL, WQZ and JL: conceptualization and design of the study. CRL: analysis of data and design and preparation of figures. CRL and WL: statistical analysis. CRL: Manuscript drafting. CRL, WL, XQZ, and JL., reviewed and edited the manuscript.

Corresponding author

Correspondence to Jing Li.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Reyes Loaiciga, C., Li, W., Zhao, XQ. et al. Comprehensive profiling of ribo-seq detected small sequences in yeast reveals robust conservation patterns and their potential mechanisms of origin. BMC Genomics 26, 856 (2025). https://doi.org/10.1186/s12864-025-12064-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12864-025-12064-0

Keywords