. 2021 Aug 25;121(16):10142-10186.

doi: 10.1021/acs.chemrev.0c01111. Epub 2021 Mar 11.

Machine Learning Force Fields

Oliver T Unke^{1

2}, Stefan Chmiela¹, Huziel E Sauceda^{1

3}, Michael Gastegger^{1

2

3}, Igor Poltavsky⁴, Kristof T Schütt¹, Alexandre Tkatchenko⁴, Klaus-Robert Müller^{1

5

6

7

8}

Affiliations

¹ Machine Learning Group, Technische Universität Berlin, 10587 Berlin, Germany.
² DFG Cluster of Excellence "Unifying Systems in Catalysis" (UniSysCat), Technische Universität Berlin, 10623 Berlin, Germany.
³ BASLEARN, BASF-TU Joint Lab, Technische Universität Berlin, 10587 Berlin, Germany.
⁴ Department of Physics and Materials Science, University of Luxembourg, L-1511 Luxembourg City, Luxembourg.
⁵ BIFOLD-Berlin Institute for the Foundations of Learning and Data, Berlin, Germany.
⁶ Department of Artificial Intelligence, Korea University, Anam-dong, Seongbuk-gu, Seoul 02841, Korea.
⁷ Max Planck Institute for Informatics, Stuhlsatzenhausweg, 66123 Saarbrücken, Germany.
⁸ Google Research, Brain Team, Berlin, Germany.

PMID: 33705118
PMCID: PMC8391964
DOI: 10.1021/acs.chemrev.0c01111

Machine Learning Force Fields

Oliver T Unke et al. Chem Rev. 2021.

. 2021 Aug 25;121(16):10142-10186.

doi: 10.1021/acs.chemrev.0c01111. Epub 2021 Mar 11.

Authors

Oliver T Unke^{1

2}, Stefan Chmiela¹, Huziel E Sauceda^{1

3}, Michael Gastegger^{1

2

3}, Igor Poltavsky⁴, Kristof T Schütt¹, Alexandre Tkatchenko⁴, Klaus-Robert Müller^{1

5

6

7

8}

Affiliations

¹ Machine Learning Group, Technische Universität Berlin, 10587 Berlin, Germany.
² DFG Cluster of Excellence "Unifying Systems in Catalysis" (UniSysCat), Technische Universität Berlin, 10623 Berlin, Germany.
³ BASLEARN, BASF-TU Joint Lab, Technische Universität Berlin, 10587 Berlin, Germany.
⁴ Department of Physics and Materials Science, University of Luxembourg, L-1511 Luxembourg City, Luxembourg.
⁵ BIFOLD-Berlin Institute for the Foundations of Learning and Data, Berlin, Germany.
⁶ Department of Artificial Intelligence, Korea University, Anam-dong, Seongbuk-gu, Seoul 02841, Korea.
⁷ Max Planck Institute for Informatics, Stuhlsatzenhausweg, 66123 Saarbrücken, Germany.
⁸ Google Research, Brain Team, Berlin, Germany.

PMID: 33705118
PMCID: PMC8391964
DOI: 10.1021/acs.chemrev.0c01111

Abstract

In recent years, the use of machine learning (ML) in computational chemistry has enabled numerous advances previously out of reach due to the computational complexity of traditional electronic-structure methods. One of the most promising applications is the construction of ML-based force fields (FFs), with the aim to narrow the gap between the accuracy of ab initio methods and the efficiency of classical FFs. The key idea is to learn the statistical relation between chemical structure and potential energy without relying on a preconceived notion of fixed chemical bonds or knowledge about the relevant interactions. Such universal ML approximations are in principle only limited by the quality and quantity of the reference data used to train them. This review gives an overview of applications of ML-FFs and the chemical insights that can be obtained from them. The core concepts underlying ML-FFs are described in detail, and a step-by-step guide for constructing and testing them from scratch is given. The text concludes with a discussion of the challenges that remain to be overcome by the next generation of ML-FFs.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interest.

Figures

**Figure 1**
Accurate *ab initio* methods are computationally demanding and can only be used to study small systems in gas phase or regular periodic materials. Larger molecules in solution, such as proteins, are typically modeled by force fields, empirical functions that trade accuracy for computational efficiency. Machine learning methods are closing this gap and allow to study increasingly large chemical systems at *ab initio* accuracy with force field efficiency.

**Figure 2**
ML-FFs combine the accuracy of *ab initio* methods and the efficiency of classical FFs. They provide easy access to a system’s potential energy surface (PES), which can in turn be used to derive a plethora of other quantities. By using them to run MD simulations on a single PES, ML-FFs allow chemical insights inaccessible to other methods (see gray box). For example, they accurately model electronic effects and their influence on thermodynamic observables and allow a natural description of chemical reactions, which is difficult or even impossible with conventional FFs. Their efficiency also allows them to be applied in situations where the Born–Oppenheimer approximation begins to break down and a single PES no longer provides an adequate description. An example is the study of nuclear quantum effects and electronically excited states (upper right). Finally, ML-FFs can be further enhanced by modeling additional properties. This provides direct access to a wide range of molecular spectra, building a bridge between theory and experiment (lower right). In general, such studies would be prohibitively expensive with *ab initio* methods.

**Figure 3**
Top: Two-dimensional projections of the PESs of different molecules, highlighting rich topological differences and various possible shapes. Bottom: Cut through the PES of keto-malondialdehyde for rotations of the two aldehyde groups. Note that the shape repeats periodically for full rotations. Regions with low potential energy are drawn in blue and high energy regions in yellow. Structure (a) leads to a steep increase in energy due to the proximity of the two oxygen atoms carrying negative partial charges. Local minima of the PES are shown in (b) and (c), whereas (d) displays structural fluctuations around the global minimum. By running molecular dynamics simulations, the most common transition paths (F1, F2, and F3) between the different minima could be revealed.

**Figure 4**
(A) Blue and red points with coordinates (x₁, x₂) are linearly inseparable. (B) By defining a suitable mapping from the input space (x₁, x₂) to a higher-dimensional feature space (x₁, x₂, x₃), blue and red points become linearly separable (gray plane at x₃ = 0.5).

**Figure 5**
Mean absolute force prediction errors (MAEs) of different ML models trained on molecules in the MD17 data set, colored by model type. Overall, kernel methods (GDML, sGDML, FCHL18/19^,) are slightly more data efficient, that is, they produce more accurate predictions with smaller training data sets, but neural network architectures (PhysNet, SchNet, DimeNet, EANN, DeePMD, DeepPot-SE, ACSF, HIP-NN) catch up quickly with increasing training set size and continue to improve when more data for training is available.

**Figure 6**
Overview of the mathematical concepts that form the basis of kernel methods. (A) Gaussian process regression of a one-dimensional function f(x) (red line) from M = 5 data samples (x_i, y_i). The black line (x) depicts the mean (eq 8) of the conditional probability p(|y) (see eq 7), whereas the gray area depicts two standard deviations from its mean (see eq 9). Note that predictions are most confident in regions where training data is present. (B) Function (x) can be expressed as a linear combination of M kernel functions K(x, x_i) weighted with regression coefficients α_i (see eq 2). In this example, the Gaussian kernel (eq 4) is used (the hyperparameter γ controls its width). (C) Influence of noise on prediction performance. Here, the function f(x) (thick gray line) is learned from M = 25 samples, however, each data point (x_i, y_i) contains observational noise (see eq 6). When the coefficients α_i are determined without regularization, i.e., no noise is assumed to be present, the model function reproduces the training samples faithfully, but undulates wildly between data points (orange line, λ = 0). The regularized solution (blue line, λ = 0.1, see eq 10) is much smoother and stays closer to the true function f(x), but individual data points are not reproduced exactly. When the regularization is too strong (green line, λ = 1.0), the model function becomes unable to fit the data. Note how regularization shrinks the magnitude of the coefficient vectors ∥α∥. (D) For constructing force fields, it is necessary to encode molecular structure with a representation x. The choice of this structural descriptor may strongly influence model performance. Here, the potential energy E of a diatomic molecule (thick gray line) is learned from M = 5 data points by two kernel machines using different structural representations (both models use a Gaussian kernel). When the interatomic distance r is used as descriptor (orange line, x = r), the predicted potential energy oscillates between data points, leading to spurious minima and qualitatively wrong behavior for large r. A model using the descriptor x = e^–r (blue line) predicts a physically meaningful potential energy curve that is qualitatively correct even when the model extrapolates.

formula image — **Figure 6**
Overview of the mathematical concepts that form the basis of kernel methods. (A) Gaussian process regression of a one-dimensional function f(x) (red line) from M = 5 data samples (x_i, y_i). The black line (x) depicts the mean (eq 8) of the conditional probability p(|y) (see eq 7), whereas the gray area depicts two standard deviations from its mean (see eq 9). Note that predictions are most confident in regions where training data is present. (B) Function (x) can be expressed as a linear combination of M kernel functions K(x, x_i) weighted with regression coefficients α_i (see eq 2). In this example, the Gaussian kernel (eq 4) is used (the hyperparameter γ controls its width). (C) Influence of noise on prediction performance. Here, the function f(x) (thick gray line) is learned from M = 25 samples, however, each data point (x_i, y_i) contains observational noise (see eq 6). When the coefficients α_i are determined without regularization, i.e., no noise is assumed to be present, the model function reproduces the training samples faithfully, but undulates wildly between data points (orange line, λ = 0). The regularized solution (blue line, λ = 0.1, see eq 10) is much smoother and stays closer to the true function f(x), but individual data points are not reproduced exactly. When the regularization is too strong (green line, λ = 1.0), the model function becomes unable to fit the data. Note how regularization shrinks the magnitude of the coefficient vectors ∥α∥. (D) For constructing force fields, it is necessary to encode molecular structure with a representation x. The choice of this structural descriptor may strongly influence model performance. Here, the potential energy E of a diatomic molecule (thick gray line) is learned from M = 5 data points by two kernel machines using different structural representations (both models use a Gaussian kernel). When the interatomic distance r is used as descriptor (orange line, x = r), the predicted potential energy oscillates between data points, leading to spurious minima and qualitatively wrong behavior for large r. A model using the descriptor x = e^–r (blue line) predicts a physically meaningful potential energy curve that is qualitatively correct even when the model extrapolates.

**Figure 7**
Kernel ridge regression can be understood as a linear integral operator T_k that is applied to the (only partially known) target function of interest f(x). Such operators are defined as convolutions with a continuous kernel function K, whose response is the regression result. Because the training data is typically not sampled on a grid, this convolution task transforms to a linear system that yields the regression coefficients α. Because only T_kf(x) and not the true f(x) is recovered, the challenge is to find a kernel that defines an operator that leaves the relevant parts of its original function invariant. This is why the Gaussian kernel (eq 4) is a popular choice: Depending on the chosen length scale γ, it attenuates high frequency components, while passing through the low frequency components of the input, therefore making only minimal assumptions about the target function. However, stronger assumptions (e.g., by combining kernels with physically motivated descriptors) increase the sample efficiency of the regressor.

**Figure 8**
Schematic representation of the mathematical concepts underlying artificial (feed-forward) neural networks. (A) A single artificial neuron can have an arbitrary number of inputs and outputs. Here, a neuron that is connected to two inputs i₁ and i₂ with “synaptic weights” w₁ and w₂ is depicted. The bias term b can be thought of as the weight of an additional input with a value of 1. Artificial neurons compute the weighted sum of their inputs and pass this value through an activation function σ to other neurons in the neural network (here, the neuron has three outputs with connection weights w′₁, w′₂, and w′₃). (B) Possible activation function σ(x). The bias term b effectively shifts the activation function along the x-axis. Many nonlinear functions are valid choices, but the most popular are sigmoid transformations such as tanh(x) or (smooth) ramp functions, for example, max(0, x) or ln(1 + e^x). (C) Artificial neural network with a single hidden layer of three neurons (gray) that maps two inputs x₁ and x₂ (blue) to two outputs y₁ and y₂ (yellow), see eq 15. For regression tasks, the output neurons typically use no activation function. Computing the weighted sums for the neurons of each layer can be efficiently implemented as a matrix vector product (eq 14). Some entries of the weight matrices (W and W′) and bias vectors (b and b′) are highlighted in color with the corresponding connection in the diagram. (D) Schematic depiction of a deep neural network with L hidden layers (eq 16). Compared to using a single hidden layer with many neurons, it is usually more parameter-efficient to connect multiple hidden layers with fewer neurons sequentially.

**Figure 9**
Overview of model selection process.

**Figure 10**
Differentiation of an energy estimator (blue) versus direct force reconstruction (red). The law of energy conservation is trivially obeyed in the first case but requires explicit *a priori* constraints in the latter scenario. The challenge in estimating forces directly lies in the complexity arising from their high 3N-dimensionality (three force components for each of the N atoms) in contrast to predicting a single scalar for the energy.

**Figure 11**
Regions of the PESs for ethanol, keto-malondialdehyde and aspirin visited during a 200 ps *ab initio* MD simulation at 500 K using the PBE+TS/DFT level of theory^, (density functional theory with the Perdew–Burke–Ernzerhof functional and Tkatchenko-Scheffler dispersion correction). The black dashed lines indicate the symmetries of the PES. Note that regions related by symmetry are not necessarily visited equally often.

**Figure 12**
Overview of descriptor-based (top) and end-to-end (bottom) NNPs. Both types of architecture take as input a set of N nuclear charges Z_i and Cartesian coordinates r_i and output atomic energy contributions E_i, which are summed to the total energy prediction E (here N = 9, an ethanol molecule is used as example). In the descriptor-based variant, pairwise distances r_ij and angles α_ijk between triplets of atoms are calculated from the Cartesian coordinates and used to compute hand-crafted two-body (G²) and three-body (G³) atom-centered symmetry functions (ACSFs) (see eqs 22 and 23). For each atom i, the values of M different G² and K different G³ ACSFs are collected in a vector x_i, which serves as a fingerprint of the atomic environment and is used as input to an NN predicting E_i. Information about the nuclear charges is encoded by having separate NNs and sets of ACSFs for all (combinations of) elements. In end-to-end NNPs, Z_i is used to initialize the vector representation x_i⁰ of each atom to an element-dependent (learnable) embedding (atoms with the same Z_i start from the same representation). Geometric information is encoded by iteratively passing these descriptors (along with pairwise distances r_ij expanded in radial basis functions g(r_ij)) in T steps through NNs representing interaction functions and atom-wise refinements (see eq 25). The final descriptors x_i^T are used as input for an additional NN predicting the atomic energy contributions (typically, a single NN is shared among all elements).

**Figure 13**
Overview of the most important steps when constructing and using ML-FFs.

**Figure 14**
(A) Two-dimensional projection of the sampled regions of the PES of ethanol at 100 K, 300 K, and 500 K from running AIMD simulations with FHI-aims (Fritz Haber Institute *ab initio* molecular simulations) at the PBE+TS/DFT level of theory.^, The length of the simulation was 500 ps. (B) Distribution of sampled potential energies for the three different temperatures.

**Figure 15**
Procedure followed to generate a database at the CCSD(T) level of theory for keto-malondialdehyde using sampling by proxy. An AIMD simulation at 500 K computed at the PBE+TS/DFT level of theory is used to sample the molecular PES. Afterward, the trajectory is subsampled (black dots) to generate a subset of representative geometries, for which single-point calculations at the CCSD(T) level of theory are performed (red dots). This highly accurate reference data is then used to train an ML-FF.

**Figure 16**
One-dimensional cut through the PES of ethanol along the O–H bond distance for different ML-FFs (solid blue, yellow, and orange lines) compared to *ab initio* reference data (dashed black line). Close to the region sampled by the training data (range highlighted in gray), all model predictions are virtually identical to the reference method (see zoomed view). When extrapolating far from the sampled region, the different models have increasingly large prediction errors and behave unphysically.

**Figure 17**
(A) One-dimensional cut through a PES predicted by different ML models. The overfitted model (red line) reproduces the training data (black dots) faithfully, but oscillates wildly in between reference points, leading to “holes” (spurious minima) on the PES. During an MD simulation, trajectories may become trapped in these regions and produce unphysical structures (inset). The properly regularized model (green line) may not reproduce all training points exactly, but fits the true PES (gray line) well, even in regions where no training data is present. However, too much regularization may lead to underfitting (blue line), that is, the model becomes unable to reproduce the training data at all. (B) Typical progress of the loss measured on the training set (blue) and on the validation set (orange) during the training of a neural network. While the training loss decreases throughout the training process, the validation loss saturates and eventually increases again, which indicates that the model starts to overfit. To prevent overfitting, the training can be stopped early once the minimum of the validation loss is reached (dotted vertical line).

**Figure 18**
Visualization of electronic effects, which are accurately modeled by ML-FFs, but neglected by conventional FFs. Electron lone pairs, hybridization changes and orbital donation effects all influence the dynamics of molecules and hence the properties that are computed from MD simulations. When predicting, for example, Gibbs/Helmholtz free energy surfaces (FESs) or molecular spectra, neglecting them will lead to qualitatively different results.

**Figure 19**
(A) Schematic description of the path integral (ring polymer) molecular dynamics (PIMD) method, where quantum particles are approximated by a classical ring polymer with P beads. There is an exact isomorphism between these two systems when P → ∞, that is, their statistical properties become equivalent. (B) PIMD simulations using DFT or coupled cluster calculations. (1) Coupled cluster PIMD simulations of the Zundel model to compute ¹H magnetic shielding tensor (adapted with permission from ref (287). Copyright 2015 published by the PCCP Owner Societies under CC BY-NC 3.0 https://creativecommons.org/licenses/by-nc/3.0/.). (2) Example of hydrogen-bond networks and their NQE implications on biological functions and enzyme catalysis (adapted with permission from ref (288). Copyright 2017 American Chemical Society.). (3) IR spectrum of the porphycen molecule computed from PIMD simulations (adapted with permission from ref (289). Copyright 2019 American Chemical Society.). (C) PIMD simulations using ML-FFs trained on DFT or coupled cluster data. (1) Ultralow temperature dynamics of the Zundel model obtained from PIMD simulations (adapted with permission from ref (290). Copyright 2018 American Chemical Society.). (2) Comparison of the statistical sampling of different conformers of ethanol between experiment and simulations (adapted with permission from ref (69). Copyright 2018 Chmiela et al.). (3) Schematic description of the enhancement in intra- and intermolecular interactions due to NQEs (adapted with permission from ref (161). Copyright 2020 Sauceda et al.).

**Figure 20**
Energy profiles of different ML-based PESs for a rotation of the dihedral angle between the terminal methylene groups of cumulenes (C_2+nH₄) of different sizes (0 ≤ n ≤ 7). All reference calculations were performed with the semiempirical MNDO method and models were trained on 4500 structures (with an additional 450 structures used for validation) collected from MD simulations at 1000 K. Because rotations of the dihedral angle are not sufficiently sampled at this temperature, the dihedral was rotated randomly before performing the reference calculations. Instead of a sharp cusp at the maximum of the rotation barrier, all models predict a smooth curve. Predictions become worse for increasing cumulene sizes with the cusp region being oversmoothed more strongly. For n = 7, all models fail to predict the angular energy dependence. Note that NNP models (such as PhysNet and SchNet) may already fail for smaller cumulenes when the cutoff distance is chosen too small (r_cut = 6 Å), as they are unable to encode information about the dihedral angle in the environment-descriptor. However, it is possible to increase the cutoff (r_cut = 12 Å) to counter this effect.

**Figure 21**
Lennard-Jones potential (thick gray line) predicted by KRR with a Gaussian kernel. (a) Potential energy is decomposed into short- (red), middle- (magenta), and long-range (blue) parts, which are learned by separate models (symbols show the training data and solid lines the model predictions). The mean squared prediction errors (in arbitrary units) for the respective regions are shown in the corresponding colors. (b) Entire potential is learned by a single model using the same training points (green). All models in panels a and b use r as the structural descriptor. (c) Single model learning the potential, but using r^–1 as structural descriptor (yellow). The mean squared errors (a.u.) for different parts of the potential in panels b and c are reported independently to allow direct comparison with the values reported in panel a.

See this image and copyright information in PMC

References

1. Feynman R. P.; Leighton R. B.; Sands M.. The Feynman Lectures On Physics; Addison-Wesley, 1963; Vol. 1.
1. McCammon J. A.; Gelin B. R.; Karplus M. Dynamics Of Folded Proteins. Nature 1977, 267, 585–590. 10.1038/267585a0. - DOI - PubMed
1. Phillips D.Biomolecular Stereodynamics; Adenine Press: Guilderland, NY, 1981.
1. Schulz R.; Lindner B.; Petridis L.; Smith J. C. Scaling Of Multimillion-atom Biological Molecular Dynamics Simulation On A Petascale Supercomputer. J. Chem. Theory Comput. 2009, 5, 2798–2808. 10.1021/ct900292r. - DOI - PubMed
1. Shaw D. E.; Deneroff M. M.; Dror R. O.; Kuskin J. S.; Larson R. H.; Salmon J. K.; Young C.; Batson B.; Bowers K. J.; Chao J. C.; et al. Anton, A Special-purpose Machine For Molecular Dynamics Simulation. Commun. ACM 2008, 51, 91–97. 10.1145/1364782.1364802. - DOI

Publication types

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Machine Learning Force Fields

Affiliations

Machine Learning Force Fields

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous