Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Aug 25;121(16):10142-10186.
doi: 10.1021/acs.chemrev.0c01111. Epub 2021 Mar 11.

Machine Learning Force Fields

Affiliations

Machine Learning Force Fields

Oliver T Unke et al. Chem Rev. .

Abstract

In recent years, the use of machine learning (ML) in computational chemistry has enabled numerous advances previously out of reach due to the computational complexity of traditional electronic-structure methods. One of the most promising applications is the construction of ML-based force fields (FFs), with the aim to narrow the gap between the accuracy of ab initio methods and the efficiency of classical FFs. The key idea is to learn the statistical relation between chemical structure and potential energy without relying on a preconceived notion of fixed chemical bonds or knowledge about the relevant interactions. Such universal ML approximations are in principle only limited by the quality and quantity of the reference data used to train them. This review gives an overview of applications of ML-FFs and the chemical insights that can be obtained from them. The core concepts underlying ML-FFs are described in detail, and a step-by-step guide for constructing and testing them from scratch is given. The text concludes with a discussion of the challenges that remain to be overcome by the next generation of ML-FFs.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interest.

Figures

Figure 1
Figure 1
Accurate ab initio methods are computationally demanding and can only be used to study small systems in gas phase or regular periodic materials. Larger molecules in solution, such as proteins, are typically modeled by force fields, empirical functions that trade accuracy for computational efficiency. Machine learning methods are closing this gap and allow to study increasingly large chemical systems at ab initio accuracy with force field efficiency.
Figure 2
Figure 2
ML-FFs combine the accuracy of ab initio methods and the efficiency of classical FFs. They provide easy access to a system’s potential energy surface (PES), which can in turn be used to derive a plethora of other quantities. By using them to run MD simulations on a single PES, ML-FFs allow chemical insights inaccessible to other methods (see gray box). For example, they accurately model electronic effects and their influence on thermodynamic observables and allow a natural description of chemical reactions, which is difficult or even impossible with conventional FFs. Their efficiency also allows them to be applied in situations where the Born–Oppenheimer approximation begins to break down and a single PES no longer provides an adequate description. An example is the study of nuclear quantum effects and electronically excited states (upper right). Finally, ML-FFs can be further enhanced by modeling additional properties. This provides direct access to a wide range of molecular spectra, building a bridge between theory and experiment (lower right). In general, such studies would be prohibitively expensive with ab initio methods.
Figure 3
Figure 3
Top: Two-dimensional projections of the PESs of different molecules, highlighting rich topological differences and various possible shapes. Bottom: Cut through the PES of keto-malondialdehyde for rotations of the two aldehyde groups. Note that the shape repeats periodically for full rotations. Regions with low potential energy are drawn in blue and high energy regions in yellow. Structure (a) leads to a steep increase in energy due to the proximity of the two oxygen atoms carrying negative partial charges. Local minima of the PES are shown in (b) and (c), whereas (d) displays structural fluctuations around the global minimum. By running molecular dynamics simulations, the most common transition paths (F1, F2, and F3) between the different minima could be revealed.
Figure 4
Figure 4
(A) Blue and red points with coordinates (x1, x2) are linearly inseparable. (B) By defining a suitable mapping from the input space (x1, x2) to a higher-dimensional feature space (x1, x2, x3), blue and red points become linearly separable (gray plane at x3 = 0.5).
Figure 5
Figure 5
Mean absolute force prediction errors (MAEs) of different ML models trained on molecules in the MD17 data set, colored by model type. Overall, kernel methods (GDML, sGDML, FCHL18/19,) are slightly more data efficient, that is, they produce more accurate predictions with smaller training data sets, but neural network architectures (PhysNet, SchNet, DimeNet, EANN, DeePMD, DeepPot-SE, ACSF, HIP-NN) catch up quickly with increasing training set size and continue to improve when more data for training is available.
Figure 6
Figure 6
Overview of the mathematical concepts that form the basis of kernel methods. (A) Gaussian process regression of a one-dimensional function f(x) (red line) from M = 5 data samples (xi, yi). The black line formula image(x) depicts the mean (eq 8) of the conditional probability p(formula image|y) (see eq 7), whereas the gray area depicts two standard deviations from its mean (see eq 9). Note that predictions are most confident in regions where training data is present. (B) Function formula image(x) can be expressed as a linear combination of M kernel functions K(x, xi) weighted with regression coefficients αi (see eq 2). In this example, the Gaussian kernel (eq 4) is used (the hyperparameter γ controls its width). (C) Influence of noise on prediction performance. Here, the function f(x) (thick gray line) is learned from M = 25 samples, however, each data point (xi, yi) contains observational noise (see eq 6). When the coefficients αi are determined without regularization, i.e., no noise is assumed to be present, the model function reproduces the training samples faithfully, but undulates wildly between data points (orange line, λ = 0). The regularized solution (blue line, λ = 0.1, see eq 10) is much smoother and stays closer to the true function f(x), but individual data points are not reproduced exactly. When the regularization is too strong (green line, λ = 1.0), the model function becomes unable to fit the data. Note how regularization shrinks the magnitude of the coefficient vectors ∥α∥. (D) For constructing force fields, it is necessary to encode molecular structure with a representation x. The choice of this structural descriptor may strongly influence model performance. Here, the potential energy E of a diatomic molecule (thick gray line) is learned from M = 5 data points by two kernel machines using different structural representations (both models use a Gaussian kernel). When the interatomic distance r is used as descriptor (orange line, x = r), the predicted potential energy oscillates between data points, leading to spurious minima and qualitatively wrong behavior for large r. A model using the descriptor x = er (blue line) predicts a physically meaningful potential energy curve that is qualitatively correct even when the model extrapolates.
Figure 7
Figure 7
Kernel ridge regression can be understood as a linear integral operator Tk that is applied to the (only partially known) target function of interest f(x). Such operators are defined as convolutions with a continuous kernel function K, whose response is the regression result. Because the training data is typically not sampled on a grid, this convolution task transforms to a linear system that yields the regression coefficients α. Because only Tkf(x) and not the true f(x) is recovered, the challenge is to find a kernel that defines an operator that leaves the relevant parts of its original function invariant. This is why the Gaussian kernel (eq 4) is a popular choice: Depending on the chosen length scale γ, it attenuates high frequency components, while passing through the low frequency components of the input, therefore making only minimal assumptions about the target function. However, stronger assumptions (e.g., by combining kernels with physically motivated descriptors) increase the sample efficiency of the regressor.
Figure 8
Figure 8
Schematic representation of the mathematical concepts underlying artificial (feed-forward) neural networks. (A) A single artificial neuron can have an arbitrary number of inputs and outputs. Here, a neuron that is connected to two inputs i1 and i2 with “synaptic weights” w1 and w2 is depicted. The bias term b can be thought of as the weight of an additional input with a value of 1. Artificial neurons compute the weighted sum of their inputs and pass this value through an activation function σ to other neurons in the neural network (here, the neuron has three outputs with connection weights w1, w2, and w3). (B) Possible activation function σ(x). The bias term b effectively shifts the activation function along the x-axis. Many nonlinear functions are valid choices, but the most popular are sigmoid transformations such as tanh(x) or (smooth) ramp functions, for example, max(0, x) or ln(1 + ex). (C) Artificial neural network with a single hidden layer of three neurons (gray) that maps two inputs x1 and x2 (blue) to two outputs y1 and y2 (yellow), see eq 15. For regression tasks, the output neurons typically use no activation function. Computing the weighted sums for the neurons of each layer can be efficiently implemented as a matrix vector product (eq 14). Some entries of the weight matrices (W and W′) and bias vectors (b and b′) are highlighted in color with the corresponding connection in the diagram. (D) Schematic depiction of a deep neural network with L hidden layers (eq 16). Compared to using a single hidden layer with many neurons, it is usually more parameter-efficient to connect multiple hidden layers with fewer neurons sequentially.
Figure 9
Figure 9
Overview of model selection process.
Figure 10
Figure 10
Differentiation of an energy estimator (blue) versus direct force reconstruction (red). The law of energy conservation is trivially obeyed in the first case but requires explicit a priori constraints in the latter scenario. The challenge in estimating forces directly lies in the complexity arising from their high 3N-dimensionality (three force components for each of the N atoms) in contrast to predicting a single scalar for the energy.
Figure 11
Figure 11
Regions of the PESs for ethanol, keto-malondialdehyde and aspirin visited during a 200 ps ab initio MD simulation at 500 K using the PBE+TS/DFT level of theory, (density functional theory with the Perdew–Burke–Ernzerhof functional and Tkatchenko-Scheffler dispersion correction). The black dashed lines indicate the symmetries of the PES. Note that regions related by symmetry are not necessarily visited equally often.
Figure 12
Figure 12
Overview of descriptor-based (top) and end-to-end (bottom) NNPs. Both types of architecture take as input a set of N nuclear charges Zi and Cartesian coordinates ri and output atomic energy contributions Ei, which are summed to the total energy prediction E (here N = 9, an ethanol molecule is used as example). In the descriptor-based variant, pairwise distances rij and angles αijk between triplets of atoms are calculated from the Cartesian coordinates and used to compute hand-crafted two-body (G2) and three-body (G3) atom-centered symmetry functions (ACSFs) (see eqs 22 and 23). For each atom i, the values of M different G2 and K different G3 ACSFs are collected in a vector xi, which serves as a fingerprint of the atomic environment and is used as input to an NN predicting Ei. Information about the nuclear charges is encoded by having separate NNs and sets of ACSFs for all (combinations of) elements. In end-to-end NNPs, Zi is used to initialize the vector representation xi0 of each atom to an element-dependent (learnable) embedding (atoms with the same Zi start from the same representation). Geometric information is encoded by iteratively passing these descriptors (along with pairwise distances rij expanded in radial basis functions g(rij)) in T steps through NNs representing interaction functions formula image and atom-wise refinements formula image (see eq 25). The final descriptors xiT are used as input for an additional NN predicting the atomic energy contributions (typically, a single NN is shared among all elements).
Figure 13
Figure 13
Overview of the most important steps when constructing and using ML-FFs.
Figure 14
Figure 14
(A) Two-dimensional projection of the sampled regions of the PES of ethanol at 100 K, 300 K, and 500 K from running AIMD simulations with FHI-aims (Fritz Haber Institute ab initio molecular simulations) at the PBE+TS/DFT level of theory., The length of the simulation was 500 ps. (B) Distribution of sampled potential energies for the three different temperatures.
Figure 15
Figure 15
Procedure followed to generate a database at the CCSD(T) level of theory for keto-malondialdehyde using sampling by proxy. An AIMD simulation at 500 K computed at the PBE+TS/DFT level of theory is used to sample the molecular PES. Afterward, the trajectory is subsampled (black dots) to generate a subset of representative geometries, for which single-point calculations at the CCSD(T) level of theory are performed (red dots). This highly accurate reference data is then used to train an ML-FF.
Figure 16
Figure 16
One-dimensional cut through the PES of ethanol along the O–H bond distance for different ML-FFs (solid blue, yellow, and orange lines) compared to ab initio reference data (dashed black line). Close to the region sampled by the training data (range highlighted in gray), all model predictions are virtually identical to the reference method (see zoomed view). When extrapolating far from the sampled region, the different models have increasingly large prediction errors and behave unphysically.
Figure 17
Figure 17
(A) One-dimensional cut through a PES predicted by different ML models. The overfitted model (red line) reproduces the training data (black dots) faithfully, but oscillates wildly in between reference points, leading to “holes” (spurious minima) on the PES. During an MD simulation, trajectories may become trapped in these regions and produce unphysical structures (inset). The properly regularized model (green line) may not reproduce all training points exactly, but fits the true PES (gray line) well, even in regions where no training data is present. However, too much regularization may lead to underfitting (blue line), that is, the model becomes unable to reproduce the training data at all. (B) Typical progress of the loss measured on the training set (blue) and on the validation set (orange) during the training of a neural network. While the training loss decreases throughout the training process, the validation loss saturates and eventually increases again, which indicates that the model starts to overfit. To prevent overfitting, the training can be stopped early once the minimum of the validation loss is reached (dotted vertical line).
Figure 18
Figure 18
Visualization of electronic effects, which are accurately modeled by ML-FFs, but neglected by conventional FFs. Electron lone pairs, hybridization changes and orbital donation effects all influence the dynamics of molecules and hence the properties that are computed from MD simulations. When predicting, for example, Gibbs/Helmholtz free energy surfaces (FESs) or molecular spectra, neglecting them will lead to qualitatively different results.
Figure 19
Figure 19
(A) Schematic description of the path integral (ring polymer) molecular dynamics (PIMD) method, where quantum particles are approximated by a classical ring polymer with P beads. There is an exact isomorphism between these two systems when P, that is, their statistical properties become equivalent. (B) PIMD simulations using DFT or coupled cluster calculations. (1) Coupled cluster PIMD simulations of the Zundel model to compute 1H magnetic shielding tensor (adapted with permission from ref (287). Copyright 2015 published by the PCCP Owner Societies under CC BY-NC 3.0 https://creativecommons.org/licenses/by-nc/3.0/.). (2) Example of hydrogen-bond networks and their NQE implications on biological functions and enzyme catalysis (adapted with permission from ref (288). Copyright 2017 American Chemical Society.). (3) IR spectrum of the porphycen molecule computed from PIMD simulations (adapted with permission from ref (289). Copyright 2019 American Chemical Society.). (C) PIMD simulations using ML-FFs trained on DFT or coupled cluster data. (1) Ultralow temperature dynamics of the Zundel model obtained from PIMD simulations (adapted with permission from ref (290). Copyright 2018 American Chemical Society.). (2) Comparison of the statistical sampling of different conformers of ethanol between experiment and simulations (adapted with permission from ref (69). Copyright 2018 Chmiela et al.). (3) Schematic description of the enhancement in intra- and intermolecular interactions due to NQEs (adapted with permission from ref (161). Copyright 2020 Sauceda et al.).
Figure 20
Figure 20
Energy profiles of different ML-based PESs for a rotation of the dihedral angle between the terminal methylene groups of cumulenes (C2+nH4) of different sizes (0 ≤ n ≤ 7). All reference calculations were performed with the semiempirical MNDO method and models were trained on 4500 structures (with an additional 450 structures used for validation) collected from MD simulations at 1000 K. Because rotations of the dihedral angle are not sufficiently sampled at this temperature, the dihedral was rotated randomly before performing the reference calculations. Instead of a sharp cusp at the maximum of the rotation barrier, all models predict a smooth curve. Predictions become worse for increasing cumulene sizes with the cusp region being oversmoothed more strongly. For n = 7, all models fail to predict the angular energy dependence. Note that NNP models (such as PhysNet and SchNet) may already fail for smaller cumulenes when the cutoff distance is chosen too small (rcut = 6 Å), as they are unable to encode information about the dihedral angle in the environment-descriptor. However, it is possible to increase the cutoff (rcut = 12 Å) to counter this effect.
Figure 21
Figure 21
Lennard-Jones potential (thick gray line) predicted by KRR with a Gaussian kernel. (a) Potential energy is decomposed into short- (red), middle- (magenta), and long-range (blue) parts, which are learned by separate models (symbols show the training data and solid lines the model predictions). The mean squared prediction errors (in arbitrary units) for the respective regions are shown in the corresponding colors. (b) Entire potential is learned by a single model using the same training points (green). All models in panels a and b use r as the structural descriptor. (c) Single model learning the potential, but using r–1 as structural descriptor (yellow). The mean squared errors (a.u.) for different parts of the potential in panels b and c are reported independently to allow direct comparison with the values reported in panel a.

References

    1. Feynman R. P.; Leighton R. B.; Sands M.. The Feynman Lectures On Physics; Addison-Wesley, 1963; Vol. 1.
    1. McCammon J. A.; Gelin B. R.; Karplus M. Dynamics Of Folded Proteins. Nature 1977, 267, 585–590. 10.1038/267585a0. - DOI - PubMed
    1. Phillips D.Biomolecular Stereodynamics; Adenine Press: Guilderland, NY, 1981.
    1. Schulz R.; Lindner B.; Petridis L.; Smith J. C. Scaling Of Multimillion-atom Biological Molecular Dynamics Simulation On A Petascale Supercomputer. J. Chem. Theory Comput. 2009, 5, 2798–2808. 10.1021/ct900292r. - DOI - PubMed
    1. Shaw D. E.; Deneroff M. M.; Dror R. O.; Kuskin J. S.; Larson R. H.; Salmon J. K.; Young C.; Batson B.; Bowers K. J.; Chao J. C.; et al. Anton, A Special-purpose Machine For Molecular Dynamics Simulation. Commun. ACM 2008, 51, 91–97. 10.1145/1364782.1364802. - DOI

Publication types