Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Sep 27;26(19):9477.
doi: 10.3390/ijms26199477.

CG-Based Stratification of 8-mers Highlights Functional Roles and Phylogenetic Divergence Markers

Affiliations

CG-Based Stratification of 8-mers Highlights Functional Roles and Phylogenetic Divergence Markers

Guojun Liu et al. Int J Mol Sci. .

Abstract

K-mer analysis is a powerful tool for understanding genome structure and evolution. A "k-mer" refers to a short DNA sequence made up of k nucleotides (where k is a specific integer), while an "m-mer" is a similar concept but with a shorter sequence length. The functional mechanisms of CG-containing k-mers, as well as their potential role in evolutionary processes, remain unclear. To explore this issue, we analyzed 8-mers in several species with varying genomic complexities and evolutionary divergences: Homo sapiens, Saccharomyces cerevisiae, Bombyx mori, Ciona intestinalis, Danio rerio, and Caenorhabditis elegans, which were grouped by CG dinucleotide content (0CG, 1CG, and 2CG). We examined the relative frequencies of shorter m-mers (with m = 3 and 4) within each CG-defined group, using information-theoretic, distance-based, and angular metrics. Our results show that 0CG motifs follow random patterns, while 1CG and 2CG motifs display significant deviations, likely due to functional constraints such as nucleosome-binding and CpG island association. The observed unimodal distribution of 8-mers arises from the convergence of the three CG-defined groups. Among them, the 2CG group shows the highest divergence in m-mer composition, followed by 1CG, reflecting varying degrees of selective pressure. Furthermore, species-specific differences in CG-classified 8-mer patterns could provide valuable insights into phylogenetic relationships. Through extensive comparison, we explore how CG content and sequence composition influence genomic organization and contribute to evolutionary divergence across different taxa. These findings deepen our understanding of short motif functions, genome organization, and sequence evolution.

Keywords: CG dinucleotide; information-theoretic analysis; k-mer distribution; sequence evolution.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

Figures

Figure 1
Figure 1
Distribution patterns of 8-mer frequencies in human chromosome 1 and analysis based on dinucleotide composition. (A) The distribution of 8-mers in human chromosome 1, with the x axis representing the number of k-mer appearances and the y axis indicating frequency of appearance (FA). (B) The same distribution shown with a log-transformed x axis. (C) A comparison with a random sequence of matched length and CG content shows that Peak3 corresponds to random 8-mer usage. (D) The number of 8-mers in the 0XY, 1XY, and 2XY subsets. (E) Distributions of 8-mers containing different numbers of CG dinucleotides. (F) Distributions of 8-mers containing different numbers of GC dinucleotides. (G) The distribution of 8-mers in chromosome 1 after removing CpG island sequences. (H) The same distribution as (G), shown with a log-transformed x axis.
Figure 2
Figure 2
The 8-mer distribution in yeast genome sequences. (A) The sequence lengths of the sixteen chromosomes in yeast. (B) The unimodal distribution of 8-mers in the yeast genome sequence, with the x axis representing the number of k-mer appearances. (C) The unimodal distribution of 8-mers in the yeast genome sequence, with the x axis representing the logarithmic scale of the number of k-mer appearances.
Figure 3
Figure 3
Distribution of 8-mers containing different counts (0, 1, 2) of the 16 dinucleotide types. (A) Distribution of 8-mers containing 0, 1, or 2 instances of CG, GC, CC, and GG dinucleotides. (B) Distribution of 8-mers containing 0, 1, or 2 instances of AA, AT, TA, and TT dinucleotides. (C) Distribution of 8-mers containing 0, 1, or 2 instances of each of the following dinucleotides: AC, AG, CA, CT, GA, GT, TC, and TG.
Figure 4
Figure 4
Usage divergence of 3-mers and 4-mers between 0CG, 0GC, 0TC, 0AA subsets and the overall 8-mer. (A) The usage divergence of 3-mers between the 0CG subset and the overall 8-mer. (B) The usage divergence of 3-mers between the 0GC subset and the overall 8-mer. (C) The usage divergence of 3-mers between the 0TC subset and the overall 8-mer. (D) The usage divergence of 3-mers between the 0AA subset and the overall 8-mer. (E) The usage divergence of 4-mers between the 0CG subset and the overall 8-mer. (F) The usage divergence of 4-mers between 0GC subset and the overall 8-mer. (G) The usage divergence of 4-mers between 0TC subset and the overall 8-mer. (H) The usage divergence of 4-mers between the 0AA subset and the overall 8-mer.
Figure 5
Figure 5
Usage divergence of 3-mer and 4-mer between the 1CG, 1GC, 1TC, and 1AA subsets and overall 8-mer. (A) The usage divergence of 3-mers between the 1CG subset and overall 8-mer. (B) The usage divergence of 3-mers between the 1GC subset and overall 8-mer. (C) The usage divergence of 3-mers between the 1TC subset and overall 8-mer. (D) The usage divergence of 3-mers between the 1AA subset and overall 8-mer. (E) The usage divergence of 4-mers between the 1CG subset and overall 8-mer. (F) The usage divergence of 4-mers between the 1GC subset and overall 8-mer. (G) The usage divergence of 4-mers between the 1TC subset and overall 8-mer. (H) The usage divergence of 4-mers between the 1AA subset and overall 8-mer.
Figure 6
Figure 6
Usage divergence of 3-mer and 4-mer between the 2CG, 2GC, 2TC, and 2AA subsets and overall 8-mer. (A) The usage divergence of 3-mers between the 2CG subset and overall 8-mer. (B) The usage divergence of 3-mers between the 2GC subset and overall 8-mer. (C) The usage divergence of 3-mers between the 2TC subset and overall 8-mer. (D) The usage divergence of 3-mers between the 2AA subset and overall 8-mer. (E) The usage divergence of 4-mers between the 2CG subset and overall 8-mer. (F) The usage divergence of 4-mers between the 2GC subset and overall 8-mer. (G) The usage divergence of 4-mers between the 2TC subset and overall 8-mer. (H) The usage divergence of 4-mers between the 2AA subset and overall 8-mer.
Figure 7
Figure 7
Analysis of the usage divergence of 3-mers and 4-mers between the 0XY, 1XY, and 2XY subsets and the overall 8-mer based on NSRE. (A) Analysis of the usage divergence of 3-mers between the 0XY subset and the overall 8-mer based on NSRE. (B) Analysis of the usage divergence of 4-mers between the 0XY subset and the overall 8-mer based on NSRE. (C) Analysis of the usage divergence of 3-mers between the 1XY subset and the overall 8-mer based on NSRE. (D) Analysis of the usage divergence of 4-mers between the 1XY subset and the overall 8-mer based on NSRE. (E) Analysis of the usage divergence of 3-mers between the 2XY subset and the overall 8-mer based on NSRE. (F) Analysis of the usage divergence of 4-mers between the 2XY subset and the overall 8-mer based on NSRE.
Figure 8
Figure 8
Analysis of the usage divergence of 3-mers and 4-mers between the 0XY, 1XY, and 2XY subsets and the overall 8-mer based on S1. (A) Analysis of the usage divergence of 3-mers between the 0XY subset and the overall 8-mer based on S1. (B) Analysis of the usage divergence of 4-mers between the 0XY subset and the overall 8-mer based on S1. (C) Analysis of the usage divergence of 3-mers between the 1XY subset and the overall 8-mer based on S1. (D) Analysis of the usage divergence of 4-mers between the 1XY subset and the overall 8-mer based on S1. (E) Analysis of the usage divergence of 3-mers between the 2XY subset and the overall 8-mer based on S1. (F) Analysis of the usage divergence of 4-mers between the 2XY subset and the overall 8-mer based on S1.
Figure 9
Figure 9
Analysis of the usage divergence of 3-mers and 4-mers between the 0XY, 1XY, and 2XY subsets and the overall 8-mer based on S2. (A) Analysis of the usage divergence of 3-mers between the 0XY subset and the overall 8-mer based on S2. (B) Analysis of the usage divergence of 4-mers between the 0XY subset and the overall 8-mer based on S2. (C) Analysis of the usage divergence of 3-mers between the 1XY subset and the overall 8-mer based on S2. (D) Analysis of the usage divergence of 4-mers between the 1XY subset and the overall 8-mer based on S2. (E) Analysis of the usage divergence of 3-mers between the 2XY subset and the overall 8-mer based on S2. (F) Analysis of the usage divergence of 4-mers between the 2XY subset and the overall 8-mer based on S2.
Figure 10
Figure 10
The 8-mer distributions in Bombyx mori, Caenorhabditis elegans, Ciona intestinalis, and zebrafish. (A) Phylogenetic tree of the species studied in this paper. (B) Overall 8-mer distribution in Bombyx mori. (C) Distribution of 8-mers containing 0, 1, or 2 CG dinucleotides in Bombyx mori. (D) Overall 8-mer distribution in Caenorhabditis elegans. (E) Distribution of 8-mers containing 0, 1, or 2 CG dinucleotides in Caenorhabditis elegans. (F) Overall 8-mer distribution in Ciona intestinalis. (G) Distribution of 8-mers containing 0, 1, or 2 CG dinucleotides in Ciona intestinalis. (H) Overall 8-mer distribution in zebrafish. (I) Distribution of 8-mers containing 0, 1, or 2 CG dinucleotides in zebrafish.

References

    1. Choi J.K., Kim Y.J. Epigenetic regulation and the variability of gene expression. Nat. Genet. 2008;40:141–147. doi: 10.1038/ng.2007.58. - DOI - PubMed
    1. Choi J.K., Kim Y.J. Intrinsic variability of gene expression encoded in nucleosome positioning sequences. Nat. Genet. 2009;41:498–503. doi: 10.1038/ng.319. - DOI - PubMed
    1. Stunkel W., Kober I., Seifart K.H. A nucleosome positioned in the Distal promoter region activates transcription of the human U6 gene. Mol. Cell. Biol. 1997;17:4397–4405. doi: 10.1128/MCB.17.8.4397. - DOI - PMC - PubMed
    1. Kornberg R.D., Lorch Y. Twenty-five years of the nucleosome: Fundamental particle of the eukaiyote chromosome. Cell. 1999;98:285–294. doi: 10.1016/S0092-8674(00)81958-3. - DOI - PubMed
    1. Jiang C., Pugh B.F. Nucleosome positioning and gene regulation: Anvances through genomics. Nat. Rev. Genet. 2009;10:161–172. doi: 10.1038/nrg2522. - DOI - PMC - PubMed

LinkOut - more resources