Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Oct 4:27:4384-4392.
doi: 10.1016/j.csbj.2025.09.033. eCollection 2025.

Optimization of regulatory DNA with active learning

Affiliations

Optimization of regulatory DNA with active learning

Yuxin Shen et al. Comput Struct Biotechnol J. .

Abstract

Many biotechnology applications rely on microbial strains engineered to express heterologous proteins at maximal yield. A common strategy for improving protein output is to design expression systems with optimized regulatory DNA elements. Recent advances in high-throughput experimentation have enabled the use of machine learning predictors in tandem with sequence optimizers to find regulatory sequences with improved phenotypes. Yet the narrow coverage of training data, limited model generalization, and non-convexity of genotype-phenotype landscapes can limit the use of traditional sequence optimization algorithms. Here, we explore the use of active learning as a strategy to improve expression levels through iterative rounds of measurements, model training, and sequence sampling-and-selection. We explore convergence and performance of the active learning loop using synthetic data and an experimentally characterized genotype-phenotype landscape of yeast promoter sequences. Our results show that active learning can outperform one-shot optimization approaches in complex landscapes with a high degree of epistasis. We demonstrate the ability of active learning to effectively optimize sequences using datasets from different experimental conditions, with potential for leveraging data across laboratories, strains or growth conditions. Our findings highlight active learning as an effective framework for DNA sequence design, offering a powerful strategy for phenotype optimization in biotechnology.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

None
Graphical abstract
Fig. 1
Fig. 1
Active learning and modelling of a synthetic genotype-phenotype landscape. (A) Schematic of traditional one-shot sequence optimization and active learning. In active learning, experimental screening of optimized sequences is built into the optimization loop and employed to iteratively augment the training data. (B) NK model for fitness landscapes; N represents the DNA sequence length and K defines the number of positions that have epistatic interactions with a given position in the sequence. The value of K controls the ruggedness of the expression landscape. (C) Two dimensional t-SNE projections of the full 10nt DNA sequence space represented via one-hot encodings (1,048,576 sequences), labeled by their NK-predicted expression; t-SNE parameters no. neighbors=15 and min. distance=0.1. (D) Regression performance of a sequence-to-expression model (multilayer perceptron, MLP) trained on the NK synthetic data, before and after optimization of the hidden layer size with grid search. Plots show the coefficient of determination (R2) between model predictions and simulated ground truth on a held out test set (2,000 sequences); the MLP predictor was trained on 2,000 sequences obtained via Latin hypercube sampling of the full sequence space. Error bars denote the standard deviation of R2 across five random test sets of 2,000 sequences each.
Fig. 2
Fig. 2
Optimization of NK genotype-phenotype landscapes with active learning. (A) Optimized expression levels using active learning with two sampling strategies (random, directed evolution). Bars show average expression of final batch of 100 optimized sequences after four active learning loops, across NK landscapes of increasing ruggedness. (B) Two-dimensional t-SNE representation of the final batch of optimized sequences (black) against the ground truth expression levels (as in Fig. 1C) for the NK0 landscape using directed evolution for sampling. (C) Expression levels for each batch of 100 optimized sequences across the active learning loops for the NK0 landscape using directed evolution for sampling, with different values of the exploration parameter α in the Upper Confidence Bound reward function in Eq. (1). (D) Comparison between active learning and one-shot optimization in NK landscapes of increased ruggedness. Plots show optimized expression per batch in each active learning loop using directed evolution for sampling, against several one-shot optimizers (strong-selection weak-mutation, SSWM; random sampling, RS; gradient descent, GD) run on a multilayer perceptron (MLP) regressor. For fair comparison, the MLP was retrained on the same number of sequences as those employed for the active learning loop (N, one-shot). In all plots, dots and whiskers represent the mean and standard error across three replicates with resampled initial training set of 1,000 sequences, one drawn from Latin Hypercube Sampling and two from uniform sampling. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 3
Fig. 3
Active learning of promoter sequences in Saccharomyces cerevisiae. (A) We employed a large promoter dataset from Vaishnav et al, including more than 20 M fully randomized promoter sequences measured in two growth media. Transformer-based models can regress the expression landscape with high accuracy (Pearson r>0.95 in both growth media [13]). (B) Promoter optimization with different batches of initial data; inset shows two subsamples of n=1,000 promoters with uniform (blue) and low expression (orange); as in Fig. 2D, plots show optimized expression levels across the active learning loops, compared to three one-shot optimizers on MLP regressors trained on the same number of sequences. Low expression samples were obtained by first sampling 100,000 sequences, followed by selection of sequences with lowest expression (see Methods). Dots represent the mean and whiskers show the standard error across three replicates with resampled initial training sets as in Fig. 2D. (C) Robustness of active learning to measurement noise. Left: We added Gaussian noise to the expression levels employed to initialize the active learning loop (top) and gradually decreased the correlation with the ground truth data (bottom). Right: Optimized expression levels after four active learning loops with directed evolution for sampling. Expression levels were perturbed with additive Gaussian noise with mean μ equal to the ground truth and increasing variance σ2. Noise was quantified by the coefficient of variation (CV= σ/μ) as a measure of dispersion (D) Active learning of promoter sequences across growth conditions. We initialized the active learning loop with n=1,000 sequences measured in one medium, and ran the active learning optimization using iterative collection of data in a different medium. The use of sequence pre-optimization in the original medium can lead to performance improvements. All active learning results employ directed evolution for sampling promoter sequences. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 4
Fig. 4
Improving active learning performance with alternative methods for sequence sampling and selection. (A) Two additional sequence sampling methods: genetic drift and recombination. (B) Comparison of active learning performance for different sequence sampling methods, initialized with a Latin Hypercube Sampling batch. (C) Enrichment of promoter motifs across the active learning loops for different sequence sampling strategies. Shown are average position frequency matrices (PFM) scores of 244 motifs for each batch across the active learning loops; values were normalized to the PFM scores of the initial batch. (D) Comparison of active learning with directed evolution for sampling and selection based on a reward function with and without motif weighting, as in Eq. (3); error bars denote replicates across three initial batches of promoters.

References

    1. Clomburg J.M., Crumbley A.M., Gonzalez R. Industrial biomanufacturing: the future of chemical production. Science. 2017;355(6320) - PubMed
    1. Cazier A.P., Blazeck J. Advances in promoter engineering: novel applications and predefined transcriptional control. Biotechnol J. 2021;16(10) - PubMed
    1. Konkle B.A., Walsh C.E., Escobar M.A., et al. Bax 335 hemophilia B gene therapy clinical trial results: potential impact of CpG sequences on gene expression. Blood J Am Soc Hematol. 2021;137(6):763–774. - PMC - PubMed
    1. Castillo-Hair S., Fedak S., Wang B., et al. Optimizing 5’UTRs for mRNA-delivered gene editing using deep learning. Nat Commun. 2024 Jun;15(1):5284. Publisher: Nature Publishing Group. - PMC - PubMed
    1. Greenbury S.F., Louis A.A., Ahnert S.E. The structure of genotype-phenotype maps makes fitness landscapes navigable. Nat Ecol Evol. 2022 Nov;6(11):1742–1752. - PubMed

LinkOut - more resources