Graphical Abstract
Abstract
Many biotechnology applications rely on microbial strains engineered to express heterologous proteins at maximal yield. A common strategy for improving protein output is to design expression systems with optimized regulatory DNA elements. Recent advances in high-throughput experimentation have enabled the use of machine learning predictors in tandem with sequence optimizers to find regulatory sequences with improved phenotypes. Yet the narrow coverage of training data, limited model generalization, and non-convexity of genotype–phenotype landscapes can limit the use of traditional sequence optimization algorithms. Here, we explore the use of active learning as a strategy to improve expression levels through iterative rounds of measurements, model training, and sequence sampling-and-selection. We explore convergence and performance of the active learning loop using synthetic data and an experimentally characterized genotype–phenotype landscape of yeast promoter sequences. Our results show that active learning can outperform one-shot optimization approaches in complex landscapes with a high degree of epistasis. We demonstrate the ability of active learning to effectively optimize sequences using datasets from different experimental conditions, with potential for leveraging data across laboratories, strains or growth conditions. Our findings highlight active learning as an effective framework for DNA sequence design, offering a powerful strategy for phenotype optimization in biotechnology.
1. Introduction
The design of DNA sequences to achieve a desired phenotype is a key challenge in biotechnology. Regulatory elements, in particular, are often employed to control protein expression across many use cases, including industrial strain design [1], [2], gene therapy [3], and mRNA therapeutics [4]. Designing regulatory sequences that maximize expression often requires many iterations and domain knowledge to discover mutations that improve the desired phenotype. Moreover, the high-dimensionality of genotype space together with higher order interactions between mutations can produce complex and highly non-convex genotype–phenotype landscapes. Such landscapes are challenging to navigate experimentally and computational models are increasingly being adopted for in silico exploration of the genotype space [5].
Advances in massively parallel reporter assays are generating extensive sequencing and phenotypic data [6], [7], [8], [9], [10], which enable the use of machine learning algorithms to model protein expression landscapes [11]. Such sequence-to-expression (STE) models have been developed for various regulatory elements, including ribosome binding sites [12], promoters [13], [14], transcriptional enhancers [15], [16], and 5’ untranslated regions [17], [18]. A common approach to sequence design is one-shot optimization, whereby STE models are first trained on sequencing and screening data, and then looped into a sequence optimization algorithm (Fig. 1A) [19], [20]. The optimizer navigates the input space towards sequences with improved predicted expression, which can then be screened in vivo. This approach has led to the discovery of improved sequences in model microbes [13], [14], [21], [22] as well as complex multicellular organisms such as Drosophila, zebrafish and mice [16], [23], [24].
Fig. 1.
Active learning and modelling of a synthetic genotype-phenotype landscape. (A) Schematic of traditional one-shot sequence optimization and active learning. In active learning, experimental screening of optimized sequences is built into the optimization loop and employed to iteratively augment the training data. (B) NK model for fitness landscapes; represents the DNA sequence length and defines the number of positions that have epistatic interactions with a given position in the sequence. The value of controls the ruggedness of the expression landscape. (C) Two dimensional t-SNE projections of the full 10nt DNA sequence space represented via one-hot encodings (1,048,576 sequences), labeled by their NK-predicted expression; t-SNE parameters no. neighbors=15 and min. distance=0.1. (D) Regression performance of a sequence-to-expression model (multilayer perceptron, MLP) trained on the NK synthetic data, before and after optimization of the hidden layer size with grid search. Plots show the coefficient of determination () between model predictions and simulated ground truth on a held out test set (2,000 sequences); the MLP predictor was trained on 2,000 sequences obtained via Latin hypercube sampling of the full sequence space. Error bars denote the standard deviation of across five random test sets of 2,000 sequences each.
There are many computational techniques to traverse the STE predictions toward maxima, including gradient ascent [22], global optimization [25] and generative modelling [14], [26], [27]. However, one-shot optimization can face limitations due to the sparse and limited coverage of the sequence space employed for training—particularly as sequence length increases. In one-shot optimization, the STE model remains fixed throughout the sequence search, and thus optimizers can divert far away from the sequence space where the model was originally trained. This can lead to low-confidence predictions, decrease the success rate of experimental testing, and increase discovery costs. While this can be mitigated by training on larger data, the cost and complexity of acquiring large data can be a barrier in many real-world use cases [11].
Active learning is a paradigm whereby models are iteratively re-trained with batches of newly acquired data (Fig. 1A). This enables the adaptive selection of new sequences to measure and maximize the information gained from each round of experiments [28]. Active learning has found applications in many biological design tasks, including protein engineering [29], [30], drug discovery [31], [32], [33], media optimization [34], [35], [36], and metabolic engineering [37], [38], [39]. Recent works have demonstrated the potential of active learning for designing enhancer sequences that control protein expression [40], [41]. Here, we explore active learning as a general tool for optimization of regulatory DNA sequences. We first examine the performance of active learning on a synthetic phenotype landscape produced with the classic NK fitness model [42] adapted to nucleotide mutations. The results suggest that active learning can effectively traverse the sequence space toward increased expression even in increasingly rugged landscapes with many local optima. We then apply the methodology to an experimentally measured expression landscape containing more than 20,000,000 promoter sequences in Saccharomyces cerevisiae, using a highly accurate Transformer-based deep learning model as a surrogate to extrapolate expression across the whole sequence space. We demonstrate the ability of active learning to robustly find sequences with improved expression, and its ability to utilize data acquired in different growth conditions. We finally explore several performance improvements by embedding biological knowledge into the strategy for sequence sampling and selection. Our results demonstrate the utility of active learning in DNA sequence design in sparsely sampled and highly non-convex protein expression landscapes.
2. Results
2.1. Active learning on a synthetic genotype–phenotype landscape
We focus on optimization of regulatory DNA sequences using sequence-to-expression (STE) machine learning models as predictors. In a typical use case, an STE model is trained on measured pairs of genotype–phenotype associations, where is a DNA sequence of length and is a readout of expression, typically quantified via fluorescence reporters or suitably designed screening assays that couple expression to fitness. In one-shot optimization, the STE model remains fixed and is iteratively queried by a sequence searching algorithm, typically using hill climbing or global optimization heuristics. We focus on an active learning approach (Fig. 1A), whereby the STE model is retrained during learning loops, with new batches of data containing new samples that are iteratively selected through the optimization loop. By careful design of a sampling-and-selection routine that balances exploration and exploitation of the sequence space, active learning can continuously improve model accuracy and navigate the predicted landscape in high-confidence regions of the STE model.
To first explore the ability of active learning to traverse highly non-convex expression landscapes, we focussed on synthetic data generated with a theoretical fitness model [43]. This allows sampling the landscape across the entire sequence space and computing global maxima as a ground truth baseline. We focussed on the NK model for fitness landscapes, because it has few parameters and produces landscapes with tunable ruggedness. The NK model was originally developed for gene-to-gene interactions [42], and we adapted it to model epistatic interactions between positions in a nucleotide sequence (Fig. 1B). In the adapted model, is the number of positions in a sequence, and models the number of other positions each nucleotide interacts epistatically with (i.e., the order of interaction). The parameter can be tuned between (no epistasis, only additive effects) and to control the ruggedness, with larger leading to increasingly complex landscapes with many local maxima and minima; more details on our implementation of the NK landscape can be found in the Methods.
We generated several NK landscapes for sequences of length , resulting in a sequence space with a total of variants. To visualize the ruggedness of the landscape, we employed the t-SNE dimensionality reduction algorithm (Fig. 1C). As increases, variants with more extreme phenotypes tend to appear more dispersed across sequence space, reflecting the increased ruggedness and non-convexity of the landscape. To first assess the challenge of modelling such complex landscapes, we trained feedforward neural networks on NK fitness values (Fig. 1D). Models were trained on 2,000 sequences with Latin Hypercube Sampling (LHS) (Supplementary Figure S1), which represents a coverage of 0.2 % of the sequence space; this is substantially larger than some of the largest datasets employed in the literature. Prediction results on a held-out test set show that in the absence of epistatic interactions (, Fig. 1D), the landscape can be regressed with reasonable accuracy. However, for and above, the introduction of higher-order interactions makes the landscape significantly more challenging to regress, even after landscape-specific optimization of the neural architecture.
We built an active learning loop designed to find the global optimum of the NK fitness landscape (Fig. 1A), assuming an initial set of variants for training that were randomly selected from the whole input space. We employed an ensemble of feedforward neural networks as a machine learning regressor of the fitness landscape. At each active learning loop, sequences are sampled and selected based on the model predictions and a reward function. First, the ensemble is queried with sampled sequences, and the Upper Confidence Bound reward function is calculated for each sequence :
| (1) |
where is the predicted value of the th sequence with the th model in the ensemble. The terms in Eq. (1) correspond to the mean and standard deviation of the predicted expression levels across an ensemble of neural networks for each sequence. The parameter controls the balance between sequence exploration and exploitation. For sequence selection, to emulate scenarios with limited data acquisition capability, we fixed the batch size to the top sequences ranked by their reward function value. At each learning loop, the selected batch then goes into the evaluation step, and is added to the existing data for model retraining in the next loop.
To test the impact of the strategy employed for sequence sampling in active learning, we compared random sampling with a biologically inspired sampling based on directed evolution (DE) [44], whereby new sequences are generated by introducing mutations at positions starting from sequences in the previous active learning batch. The results (Fig. 2A) suggest that after four active learning loops, directed evolution sampling reaches better optima than random sampling across all levels of NK ruggedness. The final-batch genotype distribution on the NK0 landscape is shown in Fig. 2B, showing that active learning in tandem with directed evolution can effectively find optimal DNA sequences, in agreement with previous studies on protein fitness optimization [30].
Fig. 2.
Optimization of NK genotype-phenotype landscapes with active learning. (A) Optimized expression levels using active learning with two sampling strategies (random, directed evolution). Bars show average expression of final batch of 100 optimized sequences after four active learning loops, across NK landscapes of increasing ruggedness. (B) Two-dimensional t-SNE representation of the final batch of optimized sequences (black) against the ground truth expression levels (as in Fig. 1C) for the NK0 landscape using directed evolution for sampling. (C) Expression levels for each batch of 100 optimized sequences across the active learning loops for the NK0 landscape using directed evolution for sampling, with different values of the exploration parameter in the Upper Confidence Bound reward function in Eq. (1). (D) Comparison between active learning and one-shot optimization in NK landscapes of increased ruggedness. Plots show optimized expression per batch in each active learning loop using directed evolution for sampling, against several one-shot optimizers (strong-selection weak-mutation, SSWM; random sampling, RS; gradient descent, GD) run on a multilayer perceptron (MLP) regressor. For fair comparison, the MLP was retrained on the same number of sequences as those employed for the active learning loop (, one-shot). In all plots, dots and whiskers represent the mean and standard error across three replicates with resampled initial training set of 1,000 sequences, one drawn from Latin Hypercube Sampling and two from uniform sampling. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
We also tested the impact of exploration and exploitation in the reward function, as their balance is crucial to ensure a thorough landscape search while maintaining strong optimization performance [45]. We compared the phenotype distribution along the active learning loops in two exploration regimes using directed evolution for sequence sampling (Fig. 2C). A lower ratio resulted in a higher mean and a lower variance of phenotype, indicating a tendency to refine predictions around already identified high expression regions while limiting broader landscape exploration.
To compare active learning with traditional optimization approaches, we implemented three one-shot optimization strategies based on random screening (RS), strong-selection weak-mutation (SSWM) and gradient descent (GD). The results (Fig. 2D) across four rounds of active learning optimization suggest that active learning can outperform one-shot methods, particularly in complex landscapes with strong epistasis. We observed comparable optimization performance with SSWM in smooth landscapes (), but under higher-order interactions, active learning can produce better optima than one-shot optimizers. Notably, sequences selected through active learning also exhibit a clear trend of performance improvement across training loops, likely due to refinements to the STE model through re-training at each loop. We also observed large batch-to-batch improvement of expression levels with active learning (Supplementary Figure S3). The success of active learning with directed evolution sampling can be attributed to its mutation from sequences that already exhibit satisfactory performance in each round of selection, which also produces STE models with improved accuracy (Supplementary Figures S4–S5). Since epistatic effects in regulatory sequences typically occur within short ranges and contiguous positions in a motif, we further explored active learning on a localized NK landscape, where interactions are restricted to neighboring positions only. The results (Supplementary Figure S6) suggest that the performance of both one-shot and active learning methods is similar to the unconstrained NK model (Fig. 2D).
Although the optimization results show that active learning can effectively ascend the expression landscape, we also observed a substantial gap between the achieved expression levels and the true global maximum; this is particularly evident in high ruggedness landscapes such as NK3, where the active learning optimum is 25 % lower than the global maximum (Fig. 2D). We explored whether an increased batch size could help improve the quality of the optimum. To this end, we ran active learning loops on the NK3 landscape with increasing batch sizes (Supplementary Figure S7). We observed a sizeable increase in the optimum when using batches of 1,000 sequences, but such gains were substantially less pronounced for larger batch sizes.
2.2. Optimization of promoter sequences in yeast
To test the utility of active learning in an experimentally characterized expression landscape, we explored the optimization of promoter sequences using a large STE dataset acquired in Saccharomyces cerevisiae [13]. These data include two sets of approximately M and M promoter sequences of fixed length (nt), alongside expression readouts of a yfp fluorescent reporter in two different growth media. This dataset is ideal for our study because we can subsample the sequence variants to mimic different real world use cases, and test the ability of active learning to learn and transfer information across different experimental conditions (Fig. 3A). To query the landscape with sequences that were not screened in the original work, we employed a transformer-based STE regressor as a surrogate for the expression landscape; this regressor was developed in the original work [13] and achieved high accuracy on independent test sequences (Pearson % in both growth media).
Fig. 3.
Active learning of promoter sequences in Saccharomyces cerevisiae. (A) We employed a large promoter dataset from Vaishnav et al, including more than 20 M fully randomized promoter sequences measured in two growth media. Transformer-based models can regress the expression landscape with high accuracy (Pearson in both growth media [13]). (B) Promoter optimization with different batches of initial data; inset shows two subsamples of promoters with uniform (blue) and low expression (orange); as in Fig. 2D, plots show optimized expression levels across the active learning loops, compared to three one-shot optimizers on MLP regressors trained on the same number of sequences. Low expression samples were obtained by first sampling 100,000 sequences, followed by selection of sequences with lowest expression (see Methods). Dots represent the mean and whiskers show the standard error across three replicates with resampled initial training sets as in Fig. 2D. (C) Robustness of active learning to measurement noise. Left: We added Gaussian noise to the expression levels employed to initialize the active learning loop (top) and gradually decreased the correlation with the ground truth data (bottom). Right: Optimized expression levels after four active learning loops with directed evolution for sampling. Expression levels were perturbed with additive Gaussian noise with mean equal to the ground truth and increasing variance . Noise was quantified by the coefficient of variation (CV= ) as a measure of dispersion (D) Active learning of promoter sequences across growth conditions. We initialized the active learning loop with sequences measured in one medium, and ran the active learning optimization using iterative collection of data in a different medium. The use of sequence pre-optimization in the original medium can lead to performance improvements. All active learning results employ directed evolution for sampling promoter sequences. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
To mimic data scenarios encountered in applications, we considered a small initial batch of promoters in two relevant cases: one where initial promoters cover a broad range of protein expression levels, and another in which the initial promoters are enriched for low expression (Fig. 3B). This allowed testing the ability of active learning to traverse the landscape from substantially different initial starting points. We compared active learning to one-shot methods using the same total number of evaluated sequences. We employed an initial batch size promoters, with active learning loops, and a batch size of top performers according to the same reward function in Eq. (1) for an ensemble of 10 feedforward neural networks evaluated at 10,000 DE-sampled sequences; we adjusted the exploration/exploitation balance in the reward function to account for the longer sequence length compared to the previous NK landscape. In the case of uniform initial data, we observed (Fig. 3B) clear batch-to-batch improvement with active learning, which outperformed one-shot methods after two active learning loops. In the case of low expression initial data, one-shot methods struggled to escape local optima at intermediate expression levels, while active learning was able to traverse expression levels close to the measured maximum. The expression distributions in Supplementary Figure S8 also shows batch-to-batch improvements in expression across the active learning loops. Despite the comparable sequence diversity in both initial batches of sequences, we observed pronounced differences in the space of sequences explored by active learning (Supplementary Figure S9).
Examination of neural network predictions against ground truth along the optimization rounds suggests that in SSWM one-shot optimization, the fixed STE regressor tends to overshoot predictions as compared to the ground truth measurements computed with the surrogate transformer model (Supplementary Figure S10). In contrast, the STE regressor in the active learning loop progressively improves accuracy with each batch. The ability of active learning to iteratively improve model accuracy is particularly noticeable in the low expression scenario, whereby the narrow distribution of training labels leads to an initial regressor with poor generalization (Supplementary Figure S11). By iteratively supplementing the model with improved sequences, active learning can improve predictions and traverse the landscape toward better phenotypes.
Since the choice of regressor can affect optimization performance, we sought to employ convolutional neural networks as a STE model inside the active learning loop. This neural architecture has been shown to have improved inductive bias and to capture promoter motifs that correlate with expression [46]. The results in Supplementary Figure S12A suggest that convolutions can indeed improve the quality of the optimum, as compared to feedforward neural networks. Furthermore, we found that increasing the batch size can also improve the performance of the active learning loop (Supplementary Figure S12B), in agreement with our findings for the synthetic NK landscape (Supplementary Figure S7).
To test the robustness of the active learning approach, we introduced measurement noise to the initial batch of promoters, and thus forced the loop to start from poorer quality data. We gradually controlled the correlation between the initial batch of measurements and ground truth by perturbing expression levels with Gaussian noise with an increasing dispersion, as quantified by the coefficient of variation (Fig. 3C). The results show that, in line with expectation, the performance of active learning decreases for starting batches that are less correlated with the ground truth, which in turn increases the number of loops required to achieve a given expression level. However, after four loops the optimizer can effectively recover performance, achieving optima of comparable quality across all noise levels tested.
We sought to examine this robustness in more detail by simulating a data transfer scenario, whereby the loop is initialized with data acquired under different experimental conditions. This mimics use cases in which data from other laboratories is used to start an active learning pipeline. We initialized the sequence optimizer with promoters measured in growth medium B, and ran an active learning loop using the Transformer network pre-trained on medium A as a surrogate for new measurements. The results (Fig. 3D) show that the active learning loop can achieve similar expression levels for both initializations, likely because of the high correlation between initial batch sequences in the two growth conditions (Supplementary Figure S13). Furthermore, when initialized with sequences pre-optimized in medium B, the active learning loop was able to reach a further 7.4 % improvement in expression in the final batch in medium A.
2.3. Performance improvement with alternative sampling and selection
In this section, we explore further performance improvements via other strategies for sequence sampling and selection on the yeast promoter dataset. We first trialed alternatives to directed evolution sampling that aim to mimic other aspects of natural evolution. Specifically, we explored two alternative ways to introduce mutations in the sampled sequences (Fig. 4A, see Methods for details):
-
•
Genetic drift In this approach new query sequences are designed by assigning a probability of mutation to each site from a randomly picked sequence in the sequences from previous active learning loop. Each sequence was generated with 10 % probability of mutation from a random sequence of the batch in the previous loop.
-
•
Recombination We break down the sequences from the querying data of the latest learning loop, and recombine them randomly. In each loop, 10 breakpoints were randomly introduced, and the sequences from the previous loop were broken at these points and recombined.
Fig. 4.
Improving active learning performance with alternative methods for sequence sampling and selection. (A) Two additional sequence sampling methods: genetic drift and recombination. (B) Comparison of active learning performance for different sequence sampling methods, initialized with a Latin Hypercube Sampling batch. (C) Enrichment of promoter motifs across the active learning loops for different sequence sampling strategies. Shown are average position frequency matrices (PFM) scores of 244 motifs [47] for each batch across the active learning loops; values were normalized to the PFM scores of the initial batch. (D) Comparison of active learning with directed evolution for sampling and selection based on a reward function with and without motif weighting, as in Eq. (3); error bars denote replicates across three initial batches of promoters.
To evaluate these sampling strategies, 10,000 sequences are sampled in each learning loop with the 3 biological sampling methods: DE, genetic drift and recombination, and the top 1000 are selected based on the reward function in Eq. (1). Fig. 4B shows the performance of sampling methods across four active learning loops. The results show that the sampling methods have a substantial impact on the active learning loop. While genetic drift performs slightly worse than DE, recombination outperforms DE, highlighting further potential of improving the active learning pipeline. In all three methods we observed a decreased but reasonable sequence diversity along the active learning loops (Supplementary Figure S9, Supplementary Figure S14).
We also observed motif enrichment in the batches of sequences sampled by directed evolution, genetic drift and recombination in the active learning pipeline. These motifs are known to bind to specific transcription factors [47]. For each AL loop, we calculated the average motif scores across the selected batch of 1000 sequences for the 244 relevant transcription factor motifs, and compared them with the average motif score in the initial batch, which has no bias towards any motif (Fig. 4C). Certain motifs were consistently enriched or depleted across all four batches and sampling methods, which suggests that active learning can capture biologically-relevant sequence features.
As another strategy to improve the active learning loop, we explored the use of introducing task-specific domain knowledge into the sequence selection step. To this end, we modified the reward function with a score that weights the presence of sequence motifs known to bind to specific transcription factors. Previous work has shown that such mechanistic information has strong correlations with expression levels [48] and helps with model generalization to new regions of the sequence space [49]. To this end, we modified the reward function by including a weight on a motif score (see Eq. (3) in Methods) mined from the YeTFaSCo database [47]. This strategy is motivated by previous work showing that most of the motifs are activators rather than repressors [48], so that maximization of the modified reward function can steer the search towards sequences enriched with relevant motifs and improved expression. The results (Fig. 4D) show that motif sampling improves active learning performance, indicating the effectiveness of including task-specific biological knowledge into the learning loop.
3. Discussion
Many applications in biotechnology and biomedicine require optimization of heterologous protein expression. One strategy is to design an expression system with highly performant regulatory DNA elements, such as promoters, enhancers or terminators, without modifications to the coding sequence. Here, we demonstrate the use of active learning as a computational strategy to find regulatory sequences that improve protein expression. Using both synthetic and experimentally determined expression landscapes, our results suggest that active learning can find optimal sequences in complex expression landscapes, thanks to its ability to iteratively refine phenotypic predictions.
Active learning has been widely applied to diverse biological tasks [31], [32], [34], [35], [36], [37], [38], [39]. In the case of protein sequence design, recent studies have demonstrated that directed evolution paired with active learning can efficiently navigate protein fitness landscapes to improve enzyme function [29], [30]. Such approaches rely on iterative mutation of specific residues that are expected to impact function, such as enzymatic active sites or specific transcription factor binding motifs. In the case of regulatory DNA designed to improve expression levels, however, current experimental approaches increasingly rely on large libraries of fully randomized sequences using massively parallel reporter assays [11], [50], [51]. While this enables broader coverage of the sequence space [52], it also introduces additional challenges resulting from the ruggedness of the phenotypic landscape. Such expression fitness landscapes can have many local maxima; for example, in the case of promoter sequences such maxima may cluster around sequence regions enriched for specific transcriptional motifs.
To explore the performance of active learning on highly non-convex landscapes, we first focussed on synthetic data generated via the NK model of epistatic interactions. We show that active learning can effectively navigate such complex landscapes, which would otherwise be a substantial challenge with traditional hill climbing algorithms. We further trialed active learning on a large promoter expression dataset, using a pre-trained deep learning model [13] as a surrogate for unmeasured sequences. Previous studies have explored the transfer of data across sequence-to-expression predictor models by pre-training on different experimental conditions [53], indicating that cross-condition data can provide useful information. By swapping initial data from two growth conditions, we show that active learning can robustly find optimal sequences when initialized with expression data from different experimental conditions. Notably, we found that initializing the active learning loop with pre-optimized sequences from another experimental condition led to better expression than starting from non-optimized samples from the same condition. This suggests that active learning optimization can make effective use of data acquired in different growth conditions or laboratories.
In our implementations we compared various strategies for sampling and selection of DNA sequences along the active learning loop. However, other alternatives including Bayesian optimization [54] and generative models [26], [55] could further improve performance. Moreover, in our analysis we focused mostly on feedforward neural networks as sequence-to-expression predictors, and further studies are needed to explore the advantages of various deep learning architectures that have shown high predictive power [56]. These and other extensions offer substantial promise for the use of active learning in DNA sequence optimization.
4. Methods
4.1. Protein expression datasets
Synthetic expression landscapes
To simulate synthetic genotype–phenotype landscapes with a controllable level of ruggedness, we adapted the classic NK fitness model [42] to nucleotide sequences; represents the length of the DNA sequence and defines the order of epistatic interactions among positions. The parameter ranges from 0 to and controls the ruggedness of the landscape: higher values introduce higher-order epistatic interactions, increasing the ruggedness and number of local optima in the fitness landscape.
To create the NK landscape, we employed a previous Python implementation [57] based on the code available at https://github.com/acmater/NK_Benchmarking. We first constructed an epistatic interaction network in which each nucleotide position interacts with other positions in the sequence. Specifically, for each of the nucleotide positions, other positions were randomly selected from a uniform distribution to form an interaction matrix. When , the interaction matrix reduces to an vector. For each row, we enumerated all nucleotide combinations; each combination was assigned a fitness value sampled from a uniform distribution in the range. This leads to fitness scores (one per epistatic interaction) for each of the positions in the sequence. To compute the fitness of a specific sequence, at each position we looked up the fitness score associated with the corresponding row in the interaction matrix, and then averaged across all positions to obtain a single fitness for the whole sequence. The localized NK landscape was created similarly, but each nucleotide was allowed to interact only with adjacent nucleotides, which produced a constrained interaction matrix. The assignment of fitness scores and the calculation of phenotypes then followed the same process as the standard NK landscape.
Yeast promoter expression data
The promoter sequence ground truth models were taken from previous work by Vaishnav et al [13]. The original data contain yellow fluorescent protein (YFP) expression levels of 80-nt promoter sequences under two different experimental conditions: defined medium (M sequences) and complex medium (M sequences). Fluorescence measurements are in the range of [0, 18] across the whole dataset. A transformer-based predictor was trained in the original work for each experimental condition respectively, achieving high prediction accuracy (Pearson % in both cases). We employed these unmodified pre-trained models as surrogates for experimental measurements in our active learning loops.
4.2. Model evaluation
Model performance on the NK landscapes was scored with the coefficient of determination () between predicted expression and ground truth:
| (2) |
where and are the measurement and prediction on the test set, respectively, and is the average expression measured. The score is exactly zero for the naive regressor, i.e. a model that predicts average expression in the test data for all variants in the test set, and negative values indicate a flawed model structure that is worse than predicting the average observation.
4.3. Active learning loop
Supervised learning of expression landscapes
We employed an ensemble of feedforward neural networks (multilayer perceptron, MLP) to regress expression levels. The ensemble enables the computation of expression and the estimation of uncertainty in predictions for sequence sampling and selection. The ensemble consists of the top 10 performing models from a pool of 40 MLPs spanning four architectures of variable complexity and 10 random weight initializations. Model architectures are detailed in Table S1, trained with the Adam optimizer and a constant learning rate of 0.001. The mean and standard deviation of predicted expressions from the 10 MLP models were used in the reward function to balance the exploration and exploitation for sequence selection.
Sequence sampling and selection via directed evolution
In each active learning loop, 10,000 sequences were generated by sampling with repetition from the top 100 sequences of the previous batch; random nucleotide substitutions were applied at 4 (NK landscape) or 10 (promoter landscape) randomly chosen positions. Sequences were then evaluated with the reward function, and the top 100 sequences on the NK landscapes or the top 1000 sequences on promoter sequence landscapes were selected for the next active learning loop.
Reward function
To score and select sequences for the next active learning loop, we employed the Upper Confidence Bound (UCB) reward function [45] specified in Eq. (1). To determine an appropriate value for the exploration parameter , we performed a grid search with a step size of 0.1 on the NK0 landscape (Supplementary Figure S2). The optimal value was found to be which was used consistently across the four NK landscapes. For the promoter datasets, we scaled by a factor based on the sequence length difference ().
Implementation
Our implementation follows the following pseudocode:
4.4. One-shot optimization
For fair comparisons with active learning in Fig. 2, Fig. 3, one-shot optimizations were computed using MLP regressors trained on the same number of sequences as in each active learning loop. The MLP architecture was selected with 5-fold cross-validation across the same architectures in Table S1, trained with Adam optimizer and a constant learning rate of 0.001. Three methods were applied consistently across the NK landscapes and the promoter sequence landscapes, detailed next.
Random sampling (RS)
For RS, 100,000 sequences were randomly generated and scored using the MLP model. The top 100 sequences based on predicted expression were selected.
Gradient descent (GD)
For GD, sequences were optimized over 200 iterations to achieve a higher predicted expression. Sequences were represented in one-hot encoding and treated as continuous-valued vectors during the optimization. Nucleotides were optimized in the continuous space, and projected back to the nucleotide space by selecting the nucleotide with the maximum score (argmax) at each position at the end of the entire gradient-descent optimization process. At each step, MLP model gradients were estimated via finite differences and used to update the sequence in the direction that improved the model output, and the final sequence after 200 iterations was selected. The full procedure was repeated independently for 100 random starting sequences to generate 100 optimized sequences.
Strong-selection weak-mutation (SSWM)
For SSWM, sequences were generated following 200 generations for the NK landscapes (100 generations for the promoter landscapes) of evolution and selection to reach a higher predicted expression. In each generation, mutations were introduced at a low rate in a small population of 10 sequences, and the mutated sequences were evaluated using the MLP model. The population of next generation consisted of the top 5 sequences and 5 single-mutant variants of the current best sequence. As with GD, the full procedure was repeated for 100 times to generate 100 optimized sequences.
4.5. Motif enrichment and motif sampling
For the results in Fig. 4D, we included an additional motif score in the reward function:
| (3) |
where is the number of MLP predictors in the ensemble, the parameter controls the balance between sequence exploration and exploitation, is the sum of the motif score, and is a penalty on the motif score; we employed parameters and . The motif score was computed from predicted binding probabilities between transcription factors (TF) and sequence motifs based on earlier work [49]. The 244 position frequency matrices (PFM) were obtained from the YeTFaSCo database [47], which provides the occurrences of each nucleotide at each position of 244 TF-specific motifs. Each PFM has dimensions of , where is the motif length. For each promoter sequence, we compute the binding probability for each motif by integrating over all sequence positions:
| (4) |
with being the probability of TF binding starting from the nucleotide in the 80-nucleotide sequence, derived from the corresponding PFM. The final value is the probability that the TF binds to at least one site within the entire sequence.
4.6. Data transfer across growth conditions
For our data transfer analysis (Fig. 3D), we employed the two pre-trained transformer models (medium A, medium B) from the original data source [13] as surrogates for ground truth data. For the data transfer across media, we first initialized the active learning loop with randomly sampled 1,000 sequences measured in medium B. We then ran the active learning optimization on medium A using iterative collection of data from measurements in that medium. For data transfer with pre-optimization, we followed the same process but pre-optimized the 1,000 initial sequences for medium B, and then ran the active learning loop for medium A. The pre-optimization process consists a separate round of four active learning loops on medium B, and the final batch of 1000 sequences with highest expression in medium B was employed to initialize the loop in medium A. The benchmark (Fig. 3D, blue) is the same as Fig. 3B (top).
CRediT authorship contribution statement
Yuxin Shen: Writing – original draft, Visualization, Methodology, Investigation, Data curation. Grzegorz Kudla: Supervision, Conceptualization. Diego A. Oyarzún: Writing – review & editing, Visualization, Supervision, Methodology, Investigation, Conceptualization.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
Y.S. was supported by the UKRI Biotechnology and Biological Sciences Research Council (BBSRC) grant number BB/T00875X/1. G.K. was supported by UK Medical Research Council (MRC) University Unit programme (MC_UU_00035/8) and a Wellcome Trust fellowship (207507).
Code availability
Python code for model training and evaluation has been deposited on Zenodo [58] at https://zenodo.org/doi/10.5281/zenodo.17190717.
Footnotes
Supplementary data for this article can be found online at doi:10.1016/j.csbj.2025.09.033.
Appendix A. Supplementary data
References
- 1.Clomburg J.M., Crumbley A.M., Gonzalez R. Industrial biomanufacturing: the future of chemical production. Science. 2017;355(6320) doi: 10.1126/science.aag0804. [DOI] [PubMed] [Google Scholar]
- 2.Cazier A.P., Blazeck J. Advances in promoter engineering: novel applications and predefined transcriptional control. Biotechnol J. 2021;16(10) doi: 10.1002/biot.202100239. [DOI] [PubMed] [Google Scholar]
- 3.Konkle B.A., Walsh C.E., Escobar M.A., et al. Bax 335 hemophilia B gene therapy clinical trial results: potential impact of CpG sequences on gene expression. Blood J Am Soc Hematol. 2021;137(6):763–774. doi: 10.1182/blood.2019004625. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Castillo-Hair S., Fedak S., Wang B., et al. Optimizing 5’UTRs for mRNA-delivered gene editing using deep learning. Nat Commun. 2024 Jun;15(1):5284. doi: 10.1038/s41467-024-49508-2. Publisher: Nature Publishing Group. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Greenbury S.F., Louis A.A., Ahnert S.E. The structure of genotype-phenotype maps makes fitness landscapes navigable. Nat Ecol Evol. 2022 Nov;6(11):1742–1752. doi: 10.1038/s41559-022-01867-z. [DOI] [PubMed] [Google Scholar]
- 6.Gordon M.G., Inoue F., Martin B., et al. LentiMprA and Mpraflow for high-throughput functional characterization of gene regulatory elements. Nat Protoc. 2020;15(8):2387–2412. doi: 10.1038/s41596-020-0333-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Kreimer A., Zeng H., Edwards M.D., et al. Predicting gene expression in massively parallel reporter assays: a comparative study. Hum Mutat. 2017;38(9):1240–1250. doi: 10.1002/humu.23197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Gilliot P.-A., Gorochowski T.E. Sequencing enabling design and learning in synthetic Biology. Curr Opin Chem Biol. 2020;58:54–62. doi: 10.1016/j.cbpa.2020.06.002. [DOI] [PubMed] [Google Scholar]
- 9.Baranowski C., Martin H.G., Oyarzún D.A., et al. Can protein expression be ‘solved’? Trends Biotechnol. 2025 Jun. doi: 10.1016/j.tibtech.2025.04.021. Publisher: Elsevier. [DOI] [PubMed] [Google Scholar]
- 10.Shen Y, Underhill J, Mulholland AJ, Oyarzún DA, Curnow P. Effective sequence-to-expression prediction for membrane proteins using machine learning and computational protein design. bioRxiv:2025.09.25.678317.
- 11.Nikolados E.-M., Oyarzún D.A. Deep learning for optimization of protein expression. Curr Opin Biotechnol. 2023 doi: 10.1016/j.copbio.2023.102941. [DOI] [PubMed] [Google Scholar]
- 12.Höllerer S., Papaxanthos L., Gumpinger A.C., et al. Large-scale DNA-based phenotypic recording and deep learning enable highly accurate sequence-function mapping. Nat Commun. 2020 Dec;11(1):3551. doi: 10.1038/s41467-020-17222-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Vaishnav E.D., de Boer C.G., Molinet J., et al. The evolution, evolvability and engineering of gene regulatory DNA. Nature. 2022 Mar;603(7901):455–463. doi: 10.1038/s41586-022-04506-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Seo E., Choi Y.-N., Shin Y.R., et al. Design of synthetic promoters for cyanobacteria with generative deep-learning model. Nucleic Acids Res. 2023;51(13):7071–7082. doi: 10.1093/nar/gkad451. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.de Almeida B.P., Reiter F., Pagani M., Stark A. Deepstarr predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers. Nat Genet. 2022;54(5):613–624. doi: 10.1038/s41588-022-01048-5. [DOI] [PubMed] [Google Scholar]
- 16.Gosai S.J., Castro R.I., Fuentes N., et al. Machine-guided design of cell-type-targeting cis-regulatory elements. Nature. 2024;634(8036):1211–1220. doi: 10.1038/s41586-024-08070-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Sample P.J., Wang B., Reid D.W., et al. Human 5’ UTR design and variant effect prediction from a massively parallel translation assay. Nat Biotechnol. 2019;37(7):803–809. doi: 10.1038/s41587-019-0164-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Nikolados E.-M., Wongprommoon A., Aodha O.M., et al. Accuracy and data efficiency in deep learning models of protein expression. Nat Commun. 2022 Dec;13(1):7755. doi: 10.1038/s41467-022-34902-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Linder J., Seelig G. Fast activation maximization for molecular sequence design. BMC Bioinformatics. 2021 Dec;22(1):510. doi: 10.1186/s12859-021-04437-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Sinai S., Kelsic E.D. A primer on model-guided exploration of fitness landscapes for biological sequence design. 2020 Oct.. arXiv:2010.10614
- 21.Angenent-Mari N.M., Garruss A.S., Soenksen L.R., et al. A deep learning approach to programmable RNA switches. Nat Commun. 2020;11(1):5057. doi: 10.1038/s41467-020-18677-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Kotopka B.J., Smolke C.D. Model-driven generation of artificial yeast promoters. Nat Commun. 2020 Dec;11(1):2113. doi: 10.1038/s41467-020-15977-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.de Almeida B.P., Schaub C., Pagani M., et al. Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo. Nature. 2024;626(7997):207–211. doi: 10.1038/s41586-023-06905-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Taskiran I.I., Spanier K.I., Dickmänken H., et al. Cell-type-directed design of synthetic enhancers. Nature. 2024;626(7997):212–220. doi: 10.1038/s41586-023-06936-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Angermueller C., Belanger D., Gane A., et al. Proceedings of the 37th international conference on machine learning. PMLR; Nov 2020. Population-based black-box optimization for biological sequence design; pp. 324–334. [Google Scholar]
- 26.Linder J., Bogard N., Rosenberg A.B., Seelig G. A generative neural network for maximizing fitness and diversity of synthetic DNA and protein sequences. Cell Syst. 2020 Jul;11(1):49–62.e16. doi: 10.1016/j.cels.2020.05.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Killoran N, Lee LJ, Delong A, et al. Generating and designing DNA with deep generative models. 2017 Dec.
- 28.Settles B. Active learning literature survey. University Of Wisconsin-Madison Department Of Computer Sciences. 2009 [Google Scholar]
- 29.Thornton E.L., Boyle J.T., Laohakunakorn N., Regan L. Cell-free protein synthesis as a method to rapidly screen machine learning-generated protease variants. ACS Synth Biol. 2025;14(25) doi: 10.1021/acssynbio.5c00062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Yang J., Lal R.G., Bowden J.C., et al. Active learning-assisted directed evolution. Nat Commun. 2025;16(1):714. doi: 10.1038/s41467-025-55987-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Reker D., Schneider G. Active-learning strategies in computer-assisted drug discovery. Drug Discov Today. 2015 Apr;20(4):458–465. doi: 10.1016/j.drudis.2014.12.004. [DOI] [PubMed] [Google Scholar]
- 32.Rodríguez-Pérez R., Miljković F., Bajorath J. Assessing the information content of structural and protein–ligand interaction representations for the classification of kinase inhibitor binding modes via machine learning and active learning. J Cheminform. 2020 May;12(1):36. doi: 10.1186/s13321-020-00434-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Hie B., Bryson B.D., Berger B. Leveraging uncertainty in machine learning accelerates biological discovery and design. Cell Syst. 2020;11(5):461–477. doi: 10.1016/j.cels.2020.09.007. [DOI] [PubMed] [Google Scholar]
- 34.Borkowski O., Koch M., Zettor A., et al. Large scale active-learning-guided exploration for in vitro protein production optimization. Nat Commun. 2020 Apr;11(1):1872. doi: 10.1038/s41467-020-15798-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Albornoz R.V., Oyarzún D.A., Burgess K. Optimisation of surfactin yield in Bacillus using data-efficient active learning and high-throughput mass spectrometry. Comput Struct Biotechnol J. 2024 Dec;23:1226–1233. doi: 10.1016/j.csbj.2024.02.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Zournas A., Incha M.R., Radivojevic T., et al. Machine learning-led semi-automated medium optimization reveals salt as key for flaviolin production in Pseudomonas putida. Commun Biol. 2025 Apr;8(1):1–14. doi: 10.1038/s42003-025-08039-2. Publisher: Nature Publishing Group. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.HamediRad M., Chao R., Weisberg S., et al. Towards a fully automated algorithm driven platform for biosystems design. Nat Commun. 2019 Nov;10(1):5150. doi: 10.1038/s41467-019-13189-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Kumar P., Adamczyk P.A., Zhang X., et al. Active and machine learning-based approaches to rapidly enhance microbial chemical production. Metab Eng. 2021 Sep;67:216–226. doi: 10.1016/j.ymben.2021.06.009. [DOI] [PubMed] [Google Scholar]
- 39.Pandi A., Diehl C., Yazdizadeh Kharrazi A., et al. A versatile active learning workflow for optimization of genetic and metabolic networks. Nat Commun. 2022 Jul;13(1):3876. doi: 10.1038/s41467-022-31245-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Friedman R.Z., Ramu A., Lichtarge S., et al. Active learning of enhancers and silencers in the developing neural retina. Cell Syst. 2025;16(1) doi: 10.1016/j.cels.2024.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Yin C., Castillo-Hair S., Byeon G.W., et al. Iterative deep learning design of human enhancers exploits condensed sequence grammar to achieve cell-type specificity. Cell Syst. 2025;16(7) doi: 10.1016/j.cels.2025.101302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Kauffman S.A., Weinberger E.D. The NK model of rugged fitness landscapes and its application to maturation of the immune response. J Theor Biol. 1989;141(2):211–245. doi: 10.1016/s0022-5193(89)80019-0. [DOI] [PubMed] [Google Scholar]
- 43.Obolski U., Ram Y., Hadany L. Key issues review: evolution on rugged adaptive landscapes. Rep Prog Phys. 2017;81(1) doi: 10.1088/1361-6633/aa94d4. [DOI] [PubMed] [Google Scholar]
- 44.Wang Y., Xue P., Cao M., et al. Directed evolution: methodologies and applications. Chem Rev. 2021;121(20):12384–12444. doi: 10.1021/acs.chemrev.1c00260. [DOI] [PubMed] [Google Scholar]
- 45.Sutton R.S., Barto A.G. Reinforcement learning: an introduction. MIT Press; Cambridge, Mass: 1998. (Adaptive Computation and Machine Learning). [Google Scholar]
- 46.Kelley D.R., Snoek J., Rinn J.L. Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 2016;26(7):990–999. doi: 10.1101/gr.200535.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.De Boer C.G., Hughes T.R. Yetfasco: a database of evaluated yeast transcription factor sequence specificities. Nucleic Acids Res. 2012;40(D1):169–179. doi: 10.1093/nar/gkr993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.de Boer C.G., Vaishnav E.D., Sadeh R., et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat Biotechnol. 2020;38(1):56–65. doi: 10.1038/s41587-019-0315-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Shen Y., Kudla G., Oyarzún D.A. Improving the generalization of protein expression models with mechanistic sequence information. Nucleic Acids Res. 2025;53(3) doi: 10.1093/nar/gkaf020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.La Fleur A., Shi Y., Seelig G. Decoding Biology with massively parallel reporter assays and machine learning. Genes Dev. 2024;38(17–20):843–865. doi: 10.1101/gad.351800.124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Zahm A.M., Owens W.S., Himes S.R., et al. A massively parallel reporter assay library to screen short synthetic promoters in mammalian cells. Nat Commun. 2024 Nov;15(1) doi: 10.1038/s41467-024-54502-9. Publisher: Nature Publishing Group. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.de Boer C.G., Vaishnav E.D., Sadeh R., et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat Biotechnol. 2020 Jan;38(1):56–65. doi: 10.1038/s41587-019-0315-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Gilliot P.-A., Gorochowski T.E. Transfer learning for cross-context prediction of protein expression from 5’UTR sequence. Nucleic Acids Res. 2024 doi: 10.1093/nar/gkae491. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Merzbacher C., Mac Aodha O., Oyarzún D.A. Bayesian optimization for design of multiscale biological circuits. ACS Synth Biol. 2023 Jul;12(7):2073–2082. doi: 10.1021/acssynbio.3c00120. Publisher: American Chemical Society. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Barazandeh S, Ozden F, Hincer A, et al. Learning to generate 5’ UTR sequences for optimized ribosome load and gene expression. 2023 Feb. [DOI] [PMC free article] [PubMed]
- 56.Rafi A.M., Nogina D., Penzar D., et al. A community effort to optimize sequence-based deep learning models of gene regulation. Nat Biotechnol. 2024 Oct:1–11. doi: 10.1038/s41587-024-02414-w. Publisher: Nature Publishing Group. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Sandhu M., Mater A.C., Matthews D.S., et al. Investigating the determinants of performance in machine learning for protein fitness prediction. Protein Sci. 2025;34(8) doi: 10.1002/pro.70235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Shen Y., Kudla G., Oyarzún D.A. Code and data for “Optimization of regulatory DNA with active learning”. Zenodo. 2025 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.





