Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Oct 14;53(19):gkaf1014.
doi: 10.1093/nar/gkaf1014.

Mapping transcription factor binding sites by learning UV damage fingerprints

Affiliations

Mapping transcription factor binding sites by learning UV damage fingerprints

Hannah E Wilson et al. Nucleic Acids Res. .

Abstract

Deciphering transcriptional networks requires methods to accurately map binding sites of sequence-specific transcription factors (ssTFs) across the genome. Here, we show that ssTF binding induces distinct patterns of UV-induced cyclobutane pyrimidine dimers (CPDs), and that these CPD 'fingerprints' can be exploited by machine learning methods to identify ssTF binding sites (TFBS). As a proof of principle, we analyzed CPD-seq data from yeast cells using the Random Forest algorithm to identify 75 TFBS bound by the Hap2/Hap3/Hap5 ssTF complex, including ∼25 new sites missed by previous chromatin immunoprecipitation (ChIP)-based experiments. Parallel analysis of the Gcr1 ssTF using a neural network trained on CPD-seq data including only 6 known binding sites identified 63 Gcr1 TFBS across the genome. Our analysis indicates that the newly identified TFBS are associated with many genes that function in expected categories (e.g. mitochondrial respiration or glycolysis), and whose mRNA levels are down-regulated in ssTF mutants. Similar analysis of CPD-capture-sequencing data from human cells identified new sites bound by the homologous Nuclear Factor-Y complex. These findings indicate that distinct cellular patterns of UV damage occurring at different classes of TFBS can be recognized by machine learning methods to map these regulatory elements with improved accuracy and single-nucleotide resolution.

PubMed Disclaimer

Conflict of interest statement

The authors declare they have no competing interests.

Figures

Graphical Abstract
Graphical Abstract
Figure 1.
Figure 1.
Using CPD-seq to analyze cellular UV damage patterns at Hap2/Hap3/Hap5 binding sites across the yeast genome. (A) Schematic detailing the experimental procedure for the cyclobutane pyrimidine dimer-sequencing (CPD-seq) method, which utilizes the repair enzymes T4 endonuclease V (T4 endoV) and apurinic/apyrimidinic endonuclease 1 (APE1) to map UV-induced CPDs across the genome at single- nucleotide resolution. (B) Plot showing average CPD induction in UV-irradiated wild-type (WT) yeast cells relative to UV-irradiated naked DNA controls at DNA regions adjacent to 35 known Hap2/Hap3/Hap5 binding sites identified by ChIP-exo. Average CPD induction is defined as the difference in CPD counts between UV-irradiated cells and the normalized naked DNA control, divided by the total number of DNA positions (i.e. binding sites) analyzed. (C) Close-up showing the average CPD induction within the Hap2/Hap3/Hap5 binding motif for 35 known binding sites in WT yeast cells. CPD induction is calculated for half-integer positions (i.e. average CPD induction at position −1.5 corresponds to CPDs forming between CC bases at positions −1 and −2 in the binding site), as before. DNA sequence logo was generated using the weblogo software. (D) Snapshot of CPD induction in the promoter region of the INHibitor (of F1F0-ATPase) INH1 gene (indicated by green rectangle on the bottom right of the display), which regulates the activity of the mitochondrial F1F0-ATPase. CPD induction is depicted for two independent CPD-seq experiments in which UV-irradiated WT cells were each compared to a naked DNA control. The third track shows cumulative CPD induction in two control CPD-seq data sets for hap5Δ cells relative to the naked DNA controls. The bottom track shows aggregate positions of ChIP-exo peaks derived from Hap2, Hap3, and Hap5 ChIP-exo experiments. Image generated using the integrated genomics viewer (IGV) software. (E) Close-up of CPD induction and ChIP-exo peaks at the Hap2/Hap3/Hap5 binding site in the INH1 promoter.
Figure 2.
Figure 2.
Using the Random Forest machine learning method to identify Hap2/3/5 binding sites based on their CPD fingerprint across the yeast genome. (A) Graph depicting training attributes for the 35 known Hap2/3/5 binding sites. Bottom panel is average CPD induction (same as Fig. 1C). Middle panel is log2 ratio of CPD enrichment in UV-irradiated WT cells relative to the naked DNA control, after flooring data to a minimum CPD count of 10. Top panel is the sequence logo of the analyzed binding sites (same as Fig. 1C). Only data for positions −1.5 to + 4.5 for CPD induction and CPD log ratio were used as training attributes. (B) Schematic showing machine learning procedure using Random Forest. 35 known Hap2/3/5 binding sites were used as positive examples and 7000 CCAAT motifs inside yeast ORFs were used as negative examples. The attribute data from each of these examples were used to train Random Forest implemented in the Weka software [42] to identify Hap2/3/5 binding sites. (C) Receiver operating characteristic (ROC) curve for Random Forest model trained on 35 positive examples and 7000 negative examples using five-fold cross validation. The area under the curve (AUC) value of the ROC curve is indicated. Inset shows the confusion matrix from the five-fold cross validation analysis, with “actual bound” indicating the 35 positive examples and “actual not bound” (NB) indicating the 7000 negative examples, while “predicted” (Pred.) indicates the number of bound and not bound identified sites by the Random Forest algorithm. (D) Plot showing CPD induction values in a 20 bp window surrounding the center of the 75 Hap2/3/5 binding sites identified by the Random Forest algorithm. Each row depicts the CPD induction values of a single binding site. Rows are organized with the 35 binding sites in the training data on top (green rectangle), 14 identified binding sites associated with a nearby ChIP-exo peak, but not present in the training data (i.e. “ChIP-exo discovered,” purple rectangle) in the middle, and 26 binding sites not associated with a Chip-exo peak (i.e. “new binding sites, no ChIP-exo,” brown rectangle) at the bottom. Columns indicate the positions of damage induction relative to the center of the bound motif. Colors indicate the sign and magnitude of the damage induction, with red indicating higher CPD induction in cells relative to naked DNA, and blue indicating lower CPD induction in cells relative to the naked DNA control (see color bar). Top of the panel gives the sequence logo for the 75 identified binding sites, generated using the weblogo software [63]. Left panel indicates CPD induction data for WT cells, middle panel depicts CPD induction data for hap5Δ cells, and right panel indicates the positions of ChIP-exo peaks within 50 bp of the 75 identified binding sites, derived from the aggregate of published ChIP-exo data for Hap2, Hap3, and Hap5 [36].
Figure 3.
Figure 3.
(A) Visualization of CPD induction and ChIP-exo data near a predicted Hap2/3/5 binding site in promoter of the COX4 gene. Top panel shows CPD induction in derived from normalized CPD-seq data for two wildtype (WT) replicates and hap5Δ mutant cells, relative to naked DNA controls. The positions of ChIP-exo peaks and ChIP-exo reads for the Hap2, Hap3, and Hap5 subunits in aggregate are also depicted. Position of COX4 gene is indicated with green rectangle on bottom left of panel. Bottom panel shows a zoomed in view near the predicted Hap2/3/5 binding site. ChIP-exo data from [36]. (B) Same as panel A, except for a predicted Hap2/3/5 binding site in the promoter of the COX5A gene. Figure generated using IGV [35].
Figure 4.
Figure 4.
Characterization of Hap2/3/5 binding sites and associated target genes identified by CPD-seq. (A) Venn diagram showing overlap between Hap2/3/5 binding sites by Random Forest analysis of CPD-seq data (i.e. CPD fingerprint) and peaks/binding sites identified by ChIP-exo [36]. Binding site is located within 50 bp of a ChIP-exo peak were considered to be overlapping. (B-D) Lists of Hap2/3/5 target genes identified by (B) CPD fingerprint only, (C) ChIP-exo and CPD fingerprint, and (D) ChIP-exo only. The names of genes that have an aerobic respiration deficiency when mutated are colored brown with an asterisk, while genes that function in the ‘electron transport and membrane-associated energy conservation’ functional category (but do not have an aerobic respiration phenotype) are colored brown but do not have an asterisk. Blue rectangles indicate genes whose mRNA levels that are significantly down-regulated in hap2Δ, hap3Δ, or hap5Δ mutants (P < 1 × 10−6, log2 ratio ≤ -0.5), red rectangles indicate genes whose mRNA levels that are significantly up-regulated in hap2Δ, hap3Δ, or hap5Δ mutants (P < 1 × 10−6, log2 ratio ≥ 0.5), white rectangles indicate no significant change in mRNA levels, and gray rectangles indicate no data is available for this gene. Gene expression data for hap2Δ, hap3Δ, or hap5Δ mutants is from [50]. (E) Schematic showing target genes encoding ssTFs that contain identified Hap2/3/5 binding sites identified by CPD fingerprinting. Rectangles indicate target genes, while circles indicate the Hap2/3/5 complex. Black arrows indicate regulatory interactions due to a bound Hap2/3/5 site in the promoter region of the gene. Black arrows glowing yellow indicate novel Hap2/3/5 targets identified by CPD fingerprinting but not ChIP-exo. (F) Pie chart indicating the number of down-regulated genes in hap2Δ mutant cells (P < 1 × 10−6, log2 ratio ≤ -0.5) that contain binding sites identified by both CPD fingerprint and ChIP-exo (‘Both’), CPD fingerprint only, or by neither method (‘Not Bound’). No down-regulated genes were identified by ChIP-exo only. **Indicates a significant overlap of Hap2/3/5 target genes with the genes down regulated in hap2Δ mutant cells; P < 0.0001, based on the hypergeometric distribution. (G) Fraction of Hap2/3/5 binding sites identified by CPD fingerprint (top panel), CPD fingerprint and ChIP-exo (‘Both’, middle panel), and ChIP-exo only (bottom panel) located at the indicated distance from the nearest TSS of a neighboring gene. TSS data is from [64].
Figure 5.
Figure 5.
UMAP analysis of CPD induction patterns associated with binding sites for 78 different yeast ssTFs. Average CPD induction values in UV-irradiated WT cells relative to normalized naked DNA controls were analyzed between positions −5.5 to +5.5 relative to the center of each TFBS. The indicated ssTFs are highlighted and labeled on the UMAP graph, and the average CPD induction patterns of their binding sites are plotted. Sequence logos for each set of binding sites analyzed were generated using weblogo software [63].
Figure 6.
Figure 6.
Identifying Gcr1 binding sites by learning their distinct CPD fingerprint. (A) Plot of log2 CPD enrichment (top graph) and average CPD induction (bottom graph) in UV-irradiated cells relative to the scaled naked DNA control for six known Gcr1 binding sites (based on published ChIP-exo data from [36]). Sequence logo in top panel was generated using the weblogo software [63]. (B) CPD induction values adjacent to 63 Gcr1 binding sites identified by the NN classifier. Each row depicts the CPD induction values of a single binding site. Rows are organized with the 6 known binding sites in the training data on top (green rectangle), 57 discovered binding sites not present in the training data (brown rectangle) at the bottom. Columns indicate the positions of damage induction relative to the center of the bound motif. Colors indicate the sign and magnitude of the damage induction, with red indicating higher CPD induction in cells relative to naked DNA, and blue indicating lower CPD induction in cells relative to the naked DNA control (see color bar). Top of the panel gives the sequence logo for the 63 identified binding sites, generated using the weblogo software [63]. Left panel indicates CPD induction data for WT cells, middle panel indicates the positions of ChIP-exo peaks within 50 bp of the 63 identified binding sites, derived from published ChIP-exo data for Gcr1 [36], and right panel indicates whether the binding site occurs in Ty element retrotransposon or LTR. (C) Left panel depicts changes in mRNA levels in gcr1, gcr2, and rap1 mutant cells for genes associated with a Gcr1 binding site identified by CPD fingerprinting. All of these genes are associated with discovered Gcr1 binding sites, with the exception of TPI1, which is associated with only a known Gcr1 binding site. Blue color indicates a decrease in mRNA expression in the indicated mutants, while red indicates an increase in mRNA levels (see key at bottom of panel). Gene names are color-coded to indicate the step in the glycolysis pathway (see right panel) in which the enzyme they encode functions. mRNA expression data is from [65] and analyzed using RegulatorDB [66]. (D) CPD induction data for two UV-irradiated wild-type (WT) cell replicates relative to scaled naked DNA controls, as well as locations of called Gcr1 ChIP-exo peaks (bottom panel, data from [36]), in the promoter of the TDH2 gene. Red bars indicate positive CPD induction in cells relative to naked DNA, and blue indicates negative CPD induction. Brown arrows indicate identified Gcr1 binding sites by CPD fingerprinting. Image generated using IGV [63]. (E) Same as panel D, except close-up of one of the identified Gcr1 binding sites, which is indicated by brown arrow.
Figure 7.
Figure 7.
Using CPD-capture-seq and Random Forest to identify NF-Y binding sites in human cells. (A) Graph showing average CPD induction in normalized UVC- and UVB-irradiated NHF1 cells relative to UVC- and UVB-irradiated naked DNA controls associated with 156 NF-Y binding sites identified by ENCODE [38, 39] associated with a DNase I hypersensitivity (DHS) region in melanocytes [40] that overlapped with the central 360 bp of one (or more) of the capture regions. CPD-capture-seq data is from [41]. (B) Close up of average CPD induction data shown in panel A. Sequence logo was generated using weblogo software [63]. (C) UMAP analysis of CPD induction patterns associated with binding sites for 78 different yeast transcription factors (see Fig. 4), and human NF-Y binding sites derived from analysis of human CPD-capture-seq data. Average CPD induction data for positions −5.5 to +5.5 relative to the center of each transcription factor binding sites was analyzed using UMAP. The positions of data points corresponding to yeast Hap2/Hap3/Hap5 binding sites are indicated in red, and the NF-Y data point is indicated in orange. (D) Plot showing average CPD induction in normalized UVC- and UVB-irradiated NHF1 cells relative to UVC- and UVB-irradiated naked DNA controls for 337 NF-Y binding sites identified by the Random Forest algorithm. Each row corresponds to an individual binding site, and the colors indicate the magnitude of CPD induction (red) or CPD suppression (blue) in the UV-irradiated cells relative to the naked DNA control. Binding sites are sorted based on whether the identified binding site is located within 50 bp of an ENCODE NF-Y binding site identified by ChIP-seq analysis (i.e. ‘ENCODE binding sites (Discovered)’, top of panel) or not (i.e. ‘New binding sites (No ENCODE binding site)’, bottom of panel). The right panel indicates the location of the identified NF-Y ENCODE binding site relative to the identified NF-Y binding site identified by CPD-capture-seq. ENCODE data from [38, 39]. (E) Snapshot of CPD induction values in normalized UVB- or UVC-irradiated NHF1 cells relative to UVB- or UVC-irradiated naked DNA controls, associated with three identified NF-Y binding sites identified by Random Forest analysis of human CPD-capture-seq data. Dark green rectangles represent the location of capture region, while the light green rectangle with black outline indicates location of the NDUFS8 gene. Arrows indicate locations of identified NF-Y binding sites, with purple arrows indicating NF-Y binding sites previously identified by ENCODE, and the brown arrow indicating a new binding site identified only by CPD fingerprinting. Bottom panels depict close-ups of CPD induction at each identified NF-Y binding site. Images generated using IGV [35].

References

    1. Cramer P Organization and regulation of gene transcription. Nature. 2019; 573:45–54. 10.1038/s41586-019-1517-4. - DOI - PubMed
    1. Weake VM, Workman JL Inducible gene expression: diverse regulatory mechanisms. Nat Rev Genet. 2010; 11:426–37. 10.1038/nrg2781. - DOI - PubMed
    1. Lambert SA, Jolma A, Campitelli LF et al. The human transcription factors. Cell. 2018; 172:650–65. 10.1016/j.cell.2018.01.029. - DOI - PubMed
    1. Kwast KE, Burke PV, Poyton RO Oxygen sensing and the transcriptional regulation of oxygen-responsive genes in yeast. J Exp Biol. 1998; 201:1177–95. 10.1242/jeb.201.8.1177. - DOI - PubMed
    1. Bolotin-Fukuhara M Thirty years of the HAP2/3/4/5 complex. Biochim Biophys Acta. 2017; 1860:543–59. 10.1016/j.bbagrm.2016.10.011. - DOI - PubMed

Substances