Monte Carlo approximation of the logarithm of the determinant of large matrices with applications for linear mixed models in quantitative genetics

Bermann, Matias; Alvarez-Munera, Alejandra; Legarra, Andres; Aguilar, Ignacio; Misztal, Ignacy; Lourenco, Daniela

doi:10.1186/s12711-025-00991-1

Research Article
Open access
Published: 06 August 2025

Monte Carlo approximation of the logarithm of the determinant of large matrices with applications for linear mixed models in quantitative genetics

Genetics Selection Evolution volume 57, Article number: 44 (2025) Cite this article

756 Accesses
Metrics details

Abstract

Background

Likelihood-based inferences such as variance components estimation and hypothesis testing need logarithms of the determinant (log-determinant) of high dimensional matrices. Calculating the log-determinant is memory and time-consuming, making it impossible to perform likelihood-based inferences for large datasets.

Results

We presented a method for approximating the log-determinant of positive semi-definite matrices based on repeated matrix–vector products and complex calculus. We tested the approximation of the log-determinant in beef and dairy cattle, chicken, and pig datasets including single and multiple-trait models. Average absolute relative differences between the approximated and exact log-determinant were around 10^–3. The approximation was between 2 and 500 times faster than the exact calculation for medium and large matrices. We compared the restricted likelihood with (approximated) and without (exact) the approximation of the log-determinant for different values of heritability for a single-trait model. We also compared estimated variance components using exact expectation–maximization (EM) and average information (AI) REML algorithms, against two derivative-free approaches using the restricted likelihood calculated with the log-determinant approximation. The approximated and exact restricted likelihood showed maxima at the same heritability value. Derivative-free estimation of variance components with the approximated log-determinant converged to the same values as EM and AI-REML. The proposed approach is feasible to apply to any data size.

Conclusions

The method presented in this study allows to approximate the log-determinant of positive semi-definite matrices and, therefore, the likelihood for datasets of any size. This opens the possibility of performing likelihood-based inferences for large datasets in animal and plant breeding.

Background

Likelihood-based inferences in quantitative analysis of traits in animal and plant breeding need logarithms of the determinant (log-determinant) of high dimensional matrices. Popular likelihood-based inferences in animal breeding and genetics are: parametric model comparison using the Likelihood Ratio Test or Akaike Information Criterion (AIC); using the value of the log-likelihood as a convergence criterion in expectation–maximization (EM) or average information (AI) REML; or optimizing numerically the log-likelihood function in derivative-free REML. The most computationally expensive step in likelihood-based inferences is the calculation of the log-determinant of high dimensional matrices. Thus, alleviating such a cost in terms of memory and time requirements would allow using large datasets and reduce the use of computing resources for likelihood-based inferences.

The matrices involved in likelihood-based inference are positive semi-definite. Thus, the (pseudo) log-determinant is calculated using the Cholesky (or LDL) decomposition of the matrix with a computing cost proportional to $O\left({n}^{2}\right)$, where $n$ is the dimension of the matrix. Although with sparse-matrix techniques the Cholesky decomposition is feasible for large matrices, it becomes time and memory-prohibitable for very large models or when genomic information is used, which considerably reduces the sparsity of the involved matrices [1]. Therefore, likelihood-based inferences cannot be made for large datasets or moderate-size datasets with genomic information. With frequentist or Markov Chain Monte Carlo (MCMC) methods, likelihood-based inferences are suitable in terms of time and memory usage for models with less than a couple of millions of equations. In contrast, large models for genetic evaluations could have tens or hundreds of millions of equations.

Since exact methods are unfeasible to apply to modern datasets with existing computing resources, numerical linear algebra is moving towards approximation methods for large applications. By employing notions of complex calculus, Hale et al. [2] and Aune et al. [3] approximated the log-determinant of a matrix $\mathbf{M}$ based on repeated matrix–vector products $\mathbf{M}\mathbf{x}$, where $\mathbf{x}$ is a suitable vector. In animal and plant breeding, there are very efficient algorithms for specific matrix–vector multiplications for extremely large matrices. In fact, most often these methods do not set up matrices explicitly. These algorithms involve iteration on data techniques [4,5,6,7] and specific methods for manipulating genomic matrices [8,9,10]. Therefore, approximating the log-determinant of these matrices by means of matrix–vector products seems feasible.

This study aims to introduce and adapt the approximation of log-determinants based on matrix–vector products and complex calculus ideas, which has already been used in other fields like physics, engineering, and spatial statistics, to the animal and plant breeding and genetics field. The approximation of the log-determinant will be compared against its exact counterpart, in terms of accuracy and computing time. Applications of the log-determinant approximation for variance components estimation will be tested and discussed.

Methods

Theory

In this section, we briefly review the theory for approximating $\text{log}\left|\mathbf{M}\right|$, where we assume that $\mathbf{M}$ is a positive semi-definite matrix. For more details, we refer the reader to Hale et al. [2] and Aune et al. [3].

Because of the Jacobi’s Determinant Lemma [11; p.309], we have the following identity:

$$\log \left| {\mathbf{M}} \right| = {\text{tr}}\left( {\log \left( {\mathbf{M}} \right)} \right)$$

(1)

where $\text{log}\left(\mathbf{M}\right)$ is the matrix logarithm (for a matrix with an eigendecomposition $\mathbf{M}=\mathbf{U}\mathbf{D}{\mathbf{U}}^{\mathbf{^{\prime}}}$, $\text{log}\left(\mathbf{M}\right)=\mathbf{U}\text{log}\left(\mathbf{D}\right){\mathbf{U}}^{\mathbf{^{\prime}}}$). Then, Eq. (1) is approximated by the Monte Carlo-based Hutchinson estimator [12] as:

$${\text{tr}}\left( {\log \left( {\mathbf{M}} \right)} \right) \approx \frac{1}{s}\mathop \sum \limits_{i = 1}^{s} {\mathbf{v}}_{i}^{\prime} \log \left( {\mathbf{M}} \right){\mathbf{v}}_{i}$$

(2)

where $s$ is the number of samples and ${\mathbf{v}}_{i}$ is a random vector taking values 1 or − 1 with a probability equal to 0.5. In this expression, the product $\text{log}\left(\mathbf{M}\right){\mathbf{v}}_{i}$ does not require explicit computation of the matrix $\text{log}\left(\mathbf{M}\right)$. Instead, the product is approximated using the Method 2 of Hale et al. [2] by solving the following integral based on Cauchy’s integral formula [13; p. 119]:

$$\log \left( {\mathbf{M}} \right){\mathbf{v}}_{i} = \frac{{\mathbf{M}}}{2\pi i}\mathop \int \limits_{{\Gamma }} z^{ - 1} \log \left( z \right)\left( {z{\mathbf{I}} - {\mathbf{M}}} \right)^{ - 1} {\mathbf{v}}_{i} dz$$

(3)

where $\Gamma$ is a closed contour in the region of analyticity of the natural logarithm function (all complex numbers but those on the negative real axis and zero) and looping once around the spectrum of $\mathbf{M}$ in the counterclockwise direction. Evaluating Eq. (3) involves estimating the largest $\left({\lambda }_{max}\right)$ and smallest $\left({\lambda }_{min}\right)$ eigenvalues of $\mathbf{M}$, evaluating a complete elliptic integral of the first kind [13, 14] and solving a system of linear equations with a complex coefficient matrix. Figure 1 shows the algorithm for calculating Eq. (3). Note that $\mathbf{M}$ is not required in Fig. 1, but a method to multiply $\mathbf{M}$ times a vector without explicitly setting up $\mathbf{M}$.

Application details

We will focus the application on three kinds of matrices, which will be referred to indistinctly as $\mathbf{M}$:

${\mathbf{ZZ^{\prime}}}$ or ${\mathbf{Z^{\prime}Z}}$: GBLUP or SNP-BLUP matrices, where $\mathbf{Z}$ is the genotype content matrix [15, 16].
$\mathbf{H}$: single-step GBLUP relationship matrices [17].
$\mathbf{C}$: mixed-model equations matrices [18].

Let a linear mixed model be of the form $\mathbf{y}=\mathbf{X}\mathbf{b}+\mathbf{W}\mathbf{u}+\mathbf{e}$, where $\mathbf{y}$ represents the phenotypes, $\mathbf{X}\mathbf{b}$ the fixed effects, $\mathbf{W}\mathbf{u}$ comprises all random effects, and $\mathbf{e}$ is the error term. Assuming multivariate normality of $\mathbf{y}$ with $Var\left(\mathbf{u}\right)={\mathbf{G}}_{0}\otimes \mathbf{K}$, $Var\left(\mathbf{e}\right)=\mathbf{R}$, and $Var\left(\mathbf{y}\right)=\mathbf{V}=\mathbf{W}\left({\mathbf{G}}_{0}\otimes \mathbf{K}\right){\mathbf{W}}^{\prime}+\mathbf{R}$, the log-likelihood $\left(\text{log}L\right)$ and restricted log-likelihood $\left(\text{log}{L}_{r}\right)$ are:

$$- 2{\text{log}}L \propto \log \left| {\mathbf{R}} \right| + \log \left| {{\mathbf{G}}_{0} \otimes {\mathbf{K}}} \right| + \log \left| {\left( {{\mathbf{G}}_{0} \otimes {\mathbf{K}}} \right)^{ - 1} + {\mathbf{W}}^{\prime} {\mathbf{R}}^{ - 1} {\mathbf{W}}} \right| + {\mathbf{y}}^{\prime} {\mathbf{Py}}$$

(4)

$$- 2{\text{log}}L_{r} \propto \log \left| {\mathbf{R}} \right| + \log \left| {{\mathbf{G}}_{0} \otimes {\mathbf{K}}} \right| + \log \left| {\mathbf{C}} \right| + {\mathbf{y}}^{\prime} {\mathbf{Py}}$$

(5)

where a block of $\mathbf{K}$ could be equal to $\mathbf{Z}\mathbf{Z}\boldsymbol{^{\prime}}$ or $\mathbf{H}$, and $\mathbf{P}={\mathbf{V}}^{-1}-{\mathbf{V}}^{-1}\mathbf{X}{\left({\mathbf{X}}^{\mathbf{^{\prime}}}{\mathbf{V}}^{-1}\mathbf{X}\right)}^{-1}{\mathbf{X}}^{\mathbf{^{\prime}}}{\mathbf{V}}^{-1}$. The importance of approximating the log-determinant for the matrices above-mentioned follows from Eqs. (4) and (5).

Note that we do not cover the numerator relationship matrix (A) due to the simplicity of calculating its determinant (the product of Mendelian sampling variances). The key mathematical operation of the approximation method is the efficient multiplication of $\mathbf{M}$ times a suitable vector $\mathbf{x}$. The product ${\mathbf{ZZ^{\prime}x}}\;{\text{or}}\;{\mathbf{Z^{\prime}Zx}}$ can be obtained efficiently by storing $\mathbf{Z}$ as binary integers and performing a custom multiplication [10] or by more complex algorithms [19]. For single-step matrices, it is easier to calculate ${\mathbf{H}}^{-1}\mathbf{x}$ than $\mathbf{H}\mathbf{x}$ [20]; therefore, one should approximate $\text{log}\left|{\mathbf{H}}^{-1}\right|$, and then $\text{log}\left|\mathbf{H}\right|=-\text{log}\left|{\mathbf{H}}^{-1}\right|$. The efficient product ${\mathbf{H}}^{-1}\mathbf{x}$ could be carried out by different methods [8, 9, 21]. As pointed out by the associated editor, one could calculate $\text{log}\left|\mathbf{H}\right|=-\text{log}\left|{\mathbf{A}}^{11}\right|+\text{log}\left|\mathbf{G}\right|$, where ${\mathbf{A}}^{11}$ is the block of the inverse of $\mathbf{A}$ pertaining to the non-genotyped animals and $\mathbf{G}$ is the genomic relationship matrix. This decomposition yields from applying determinants of partitioned matrices to $\mathbf{H}=\left(\begin{array}{cc}{\left({\mathbf{A}}^{11}\right)}^{-1}& -{\left({\mathbf{A}}^{11}\right)}^{-1}{\mathbf{A}}^{12}\\ 0& \mathbf{I}\end{array}\right)\left(\begin{array}{cc}{\mathbf{A}}^{11}& 0\\ 0& \mathbf{G}\end{array}\right)\left(\begin{array}{cc}{\left({\mathbf{A}}^{11}\right)}^{-1}& 0\\ -{\mathbf{A}}^{21}{\left({\mathbf{A}}^{11}\right)}^{-1}& \mathbf{I}\end{array}\right)$. However, computing $\text{log}\left|\mathbf{G}\right|$ and $\text{log}\left|{\mathbf{A}}^{11}\right|$ in an exact way in large data sets is expensive. The latter, although sparse, needs to include all non-genotyped animals, which can be millions of individuals for many applications. Also, approximating separately $\text{log}\left|\mathbf{G}\right|$ and $\text{log}\left|{\mathbf{A}}^{11}\right|$ could be equally expensive as approximating $-\text{log}\left|{\mathbf{H}}^{-1}\right|$. Furthermore, for some iteration on data applications ${\mathbf{A}}^{11}$ might not be created (the product ${\mathbf{A}}^{-1}\mathbf{x}$ is done by reading the pedigree file). For those reasons we addressed directly the computation of $-\text{log}\left|{\mathbf{H}}^{-1}\right|$. Finally, the product $\mathbf{C}\mathbf{x}$ can be obtained with iteration on data [6, 7] plus the previous methods in case $\mathbf{C}$ includes genomic information.

For estimating the maximum and minimum eigenvalues of $\mathbf{M}$ (line 1 in Fig. 1), we use the Lanczos algorithm with full orthogonalization and implicit restarting [22; p. 562–570]. The Lanczos algorithm requires repeating $\mathbf{M}\mathbf{x}$ for some iterations to reduce $\mathbf{M}$ to a tridiagonal matrix $\mathbf{T}$. Then, the eigendecomposition of $\mathbf{T}$ is performed using the Intel MKL stev subroutine [23], and its maximum and minimum eigenvalues are an estimate of ${\lambda }_{max}$ and ${\lambda }_{min}$, respectively. The Lanczos algorithm is suitable for any symmetric matrix $\mathbf{M}$.

The evaluation of the complete integral of the first kind (line 4 in Fig. 1) and the Jacobi elliptic functions (line 8 in Fig. 1) was done by adapting and rewriting the algorithms of the ellipkkp and the ellipjc functions from Driscolls’ Schwarz-Christofel Toolbox [24].

The linear system $\left({w}^{2}\mathbf{I}-\mathbf{M}\right)\mathbf{z}={\mathbf{v}}_{i}$ needs to be solved in each iteration when approximating $\text{log}\left(\mathbf{M}\right){\mathbf{v}}_{i}$, where ${w}^{2}$ is a complex number (line 11 in Fig. 1). Since $\left({w}^{2}\mathbf{I}-\mathbf{M}\right)$ is non-Hermitian but symmetric, we chose the preconditioned Conjugate Orthogonal Conjugate Gradient method (COCG) [25], which involves one complex $\mathbf{M}\mathbf{x}$ product in each iteration.

The accuracy and speed of the approximation depend on the following parameters: number of Monte Carlo samples $\left({n}_{samples}\right)$, number of Lanczos iterations $\left({n}_{lanczos}\right)$, number of iterations for the integral approximation $\left({n}_{int}\right)$, and number of COCG iterations within each iteration for the integral approximation $\left({n}_{cocg}\right)$. As default, we set ${n}_{lanczos}, {n}_{samples, } {n}_{int, }\text{ and }{n}_{cocg}$ to 200, 5, 10, and 20, respectively. These values were empirically chosen based on a sensitivity analysis.

It is worth noticing that in animal and plant breeding many of the involved matrices are non-full rank. Examples are the incidence matrix of all linear models with more than one cross-classified effect; relationship (or covariance) matrices due to the presence of clones or identical individuals; non-blended genomic relationships when there are more individuals than loci. Therefore, the determinant of those matrices is zero and its log-determinant is not defined. For those cases, the log-likelihood requires the logarithm of the pseudo-determinant of $\mathbf{M}$ as $\text{log}\left|\mathbf{M}\right|=\text{log}\left|\mathbf{L}\mathbf{D}\mathbf{L}\mathbf{^{\prime}}\right|={\sum }_{i}{1}_{{d}_{ii}>0}\text{log}\left({d}_{ii}\right)$, where $\mathbf{D}$ is a diagonal matrix, and ${1}_{{d}_{ii}>0}$ is an indicator function that equals to one if ${d}_{ii}>0$ and zero otherwise [26; p. 527–528].

All algorithms were written in Fortran 90 and implemented in the Blupf90 software suite [27]. Computations were performed on a Dell PowerEdge R740XD server with 1.5 TB of memory, 45 TB of disk, and two Intel Xeon Gold 6258R processors, using four threads for the calculations. The sparse matrix computations, used as a benchmark, were carried out with the YAMS package [28].

Materials

We tested the presented algorithm with two examples. In the first one, we show the efficiency of the method for approximating $\text{log}\left|\mathbf{C}\right|$ and $\text{log}\left|{\mathbf{H}}^{-1}\right|$ for various datasets. Also, the sensitivity of the method in terms of ${n}_{lanczos}$, ${n}_{samples}$, ${n}_{int}$, and ${n}_{cocg}$ was tested.

In the second example, we compared the exact and approximated restricted log-likelihood for a post-weaning gain (PWG) model with data provided by the American Angus Association (St. Joseph, MI). We also estimated variance components using two derivative-free methods with approximated log-likelihood and compared them against EM and AI-REML for the same dataset. The derivative-free method was used to check if the likelihood surface was correctly explored using the numerical approximation, i.e. if we arrived to the same optimum as using EM or AI-REML. It is worth noticing that EM and AI-REML do not use the approximation of the log-determinant presented in this work, because they optimize through first (and second in AI-) derivatives obtained by the Cholesky factor of $\mathbf{C}$, which also allows exact computation of the likelihood.

Example 1

We compared the log-determinant approximation versus its exact counterpart using four datasets, including dairy and beef cattle, chicken, and pigs. These datasets were already used and described in [1]. Table 1 shows the description of the datasets and the models used, including the number of equations, number of non-zero elements, and rank for $\mathbf{C}$ and ${\mathbf{H}}^{-1}$. Log-determinants were evaluated for previously estimated variance components. Also, the sensitivity of the method in terms of the number of samples and iterations for various steps was assessed.

Table 1 Description of the different datasets used to test the approximation of the log-determinant of the matrices

Full size table

Example 2

We used the log-determinant approximation method to calculate an approximated log-likelihood for a PWG model. Then, the approximated log-likelihood was compared to the exact log-likelihood for a range of heritabilities. The dataset had 169,621 animals in the pedigree, from which 76,758 had records, and 10,000 had available genotypes. The effects in the model were a contemporary group effect with 31,368 levels and the additive genetic effect. This dataset was also used for comparing a derivative-free REML with approximated log-likelihoods against EM and AI-REML. We chose Powell’s method [29] and Algorithm AS 319 [30] to minimize $-2\text{log}{L}_{r}$ (see Eq. (5)) and perform derivative-free REML.

Results and discussion

Table 2 shows the accuracy and computing time for approximating the log-determinant of different matrices. The absolute relative difference between the approximated and exact log-determinant averaged 7.85 × 10^–3, regardless whether the matrix was full rank or not. As expected, the approximation was slower than the exact method, which used sparse matrix techniques, when the dimension of the matrix was small or moderate (e.g., in the order of hundreds of thousands). However, for large-size problems with dimensions greater than a million, the approximation was 2 to 500 times faster than the exact method. The average empirical standard error of the approximations was 0.23% of the estimated value, except for $\mathbf{C}$ in dataset 3, where it represented 23.74% of the approximated log-determinant. This could possibly be caused by the high condition number of such $\mathbf{C}$ compared with the other matrices.

Table 2 Approximation¹ of the log-determinant of $\text{C}$ and $\text{H}$ for the different datasets

Full size table

As mentioned in “Application details” section, the accuracy and computing performance of the approximation depend on the following parameters: number of Monte Carlo samples $\left({n}_{samples}\right)$, number of Lanczos iterations $\left({n}_{lanczos}\right)$, number of iterations for the integral approximation $\left({n}_{int}\right)$, and number of COCG iterations within each iteration for the integral approximation $\left({n}_{cocg}\right)$. For computing purposes, it is desirable to keep those parameters to their minimum without compromising the accuracy of the approximation. The required time for approximating $\text{log}\left|\mathbf{M}\right|$ would be approximately $\left({n}_{lanczos}+{n}_{samples}{n}_{int}{n}_{cocg}\right)t$, where $t$ is the time taken by the multiplication of $\mathbf{M}$ times a vector. Figures 2, 3, 4, 5 show how the log-determinant approximation for $\mathbf{C}$ varies for each parameter while keeping the rest fixed to their default values.

For some datasets, the exact parameter is not covered by the empirical 95% confidence interval built from its estimator. The impact of this bias should be evaluated depending on the purpose of the log-determinant approximation. Previous uses of Hutchinson estimators in animal breeding were for the approximation of the trace of the product of two matrices for variance components estimation with REML [31,32,33,34], but not for computing likelihoods. However, in their studies, the authors did not compare the estimates of the traces against the true values. Therefore, any hypothesis regarding the bias in the approximation of the log-det should be assessed in future studies.

The number of Monte Carlo samples $({n}_{samples})$ employed in our study seems extremely small when comparing the proposed Monte Carlo estimator against MCMC sampling for variance components estimation. However, such small numbers were also successfully used to approximate the trace of the product of matrices for Monte Carlo REML [32,33,34]. The usage of small numbers of Monte Carlo samples for Hutchinson estimators arises from the low autocorrelation between successive samples which is a fact that does not occur when estimating variance components by MCMC methods.

The results for Example 2 are presented in Fig. 6 and Table 3. Regardless of the difference between the exact and approximated likelihood in Fig. 6, it can be observed that both have a minimum at a heritability equal to 0.18, which is the value obtained when estimating variance components with EM or AI-REML. However, for estimating variance components with approximated log-determinant, only the DF-REML with Powell’s method [29] converged to the same heritability as EM and AI-REML. Although Algorithm AS-319 [30] was the fastest method, it overestimated the heritability by five points (0.23 vs. 0.18). This overestimation was also observed when the exact log-determinants were used in place of the approximated ones (results not shown).

Table 3 Comparison between expectation maximization (EM), average information (AI), and approximated derivative-free (aDF) REML for post-weaning gain in American Angus cattle

Full size table

Typically, variance components estimation with large datasets is done by subsetting the original dataset. This practice could be problematic, for example, when the estimation of the correlation between two traits is sensitive to the history of selection on either one. In such a case, selecting a data subset that includes several generations of data is necessary to consider selection properly. This could generate a data subset that could still be very large for applying REML or MCMC methods. Furthermore, although approximated methods like the one presented in this study present numerical errors, they are usually based on mathematical and statistical properties and are easier to measure than errors arising from data subsetting.

The methods presented in this study allow likelihood-based inferences for large populations like the ones used in animal and plant breeding. Some, but not all, possible uses include parametric model validation like Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC), derivative-free algorithms for estimating variance components [35], calculating likelihood values to assess the convergence of Monte Carlo REML [34], and estimating relationships across metafounders without truncating the likelihood [36]. Although derivative-free methods are currently not used, the most popular used algorithms dated from the’60s [29, 37]; hence, new derivative-free algorithms with better convergence properties than older ones could be a future research topic. These new algorithms, coupled with log-determinant approximations could provide a computationally attractive way of estimating variance components where memory and running time make REML and MCMC methods unfeasible to apply. Current use of genomic selection and large size of historical datasets is leading to this situation. For instance, national dairy cattle evaluations have hundreds of millions of equations and more than one million genotyped animals [38, 39]. In plant breeding, the models have more traits but less equations and genotyped individuals than in animal breeding. In such setup, the computational cost of the methods used in plant breeding to address such peculiarity do not scale well when increasing the number of genotypes [40,41,42].

The exact calculation of log-determinant should still be preferred whenever possible. In cases where the exact calculation is not possible, it is necessary to note that errors due to the random sampling of the Hutchinson estimator occur during the approximation process. Even though the estimator is unbiased, the Monte Carlo error could cause bias in estimated parameters, especially when iterative algorithms are close to convergence. Previous studies have shown that reusing the same sequence of random numbers for each round of the Hutchinson estimator reduced the noise of the parameter’s estimates and the number of rounds until convergence [34]. Therefore, this might be a possible way of dealing with the statistical noise of the approximation of the log-determinant when using it with Monte Carlo REML or derivative-free algorithms. Furthermore, it would make sense to apply this technique when comparing models based on parametric statistics or for hypothesis testing so the random noise in different approximations is canceled out.

Conclusions

We brought from other fields a method to approximate the log-determinant of a positive semi-definite matrix and adapted it to the field of animal breeding and genetics and, consequently, to plant breeding. The algorithm relies on successive matrix–vector multiplications, which can be done for linear mixed models in quantitative genetics due to custom efficient algorithms employed in routine genetic evaluations. The method can be applied to matrices of any size and could be used for Monte Carlo REML, derivate-free REML, parametric validation, and hypothesis testing. Approximating the log-determinant introduces random noise due to replacing an exact value with an approximated one. This error should be considered when using the proposed methodology.

Data availability

Data and material are available upon request.

References

Masuda Y, Aguilar I, Tsuruta S, Misztal I. Technical note: acceleration of sparse operations for average-information REML analyses with supernodal methods and sparse-storage refinements. J Anim Sci. 2015;93:4670–4.
Article CAS PubMed Google Scholar
Hale N, Higham N, Trefethen L. Computing aAlpha log(A) and related matrix functions by contour integrals. SIAM J Numer Anal. 2008;46:2505–23.
Article Google Scholar
Aune E, Simpson DP, Eidsvik J. Parameter estimation in high dimensional Gaussian distributions. Stat Comput. 2014;24:247–63.
Article Google Scholar
Schaeffer LR, Kennedy BW. Computing strategies for solving mixed model equations. J Dairy Sci. 1986;69:575–9.
Article Google Scholar
Misztal I, Gianola D. Indirect solution of mixed model equations. J Dairy Sci. 1987;70:716–23.
Article Google Scholar
Strandén I, Lidauer M. Solving large mixed linear models using preconditioned conjugate gradient iteration. J Dairy Sci. 1999;82:2779–87.
Article PubMed Google Scholar
Tsuruta S, Misztal I, Strandén I. Use of the preconditioned conjugate gradient algorithm as a generic solver for mixed-model equations in animal breeding applications. J Anim Sci. 2001;79:1166–72.
Article CAS PubMed Google Scholar
Misztal I. Inexpensive computation of the inverse of the genomic relationship matrix in populations with small effective population size. Genetics. 2016;202:401–9.
Article CAS PubMed Google Scholar
Masuda Y, Misztal I, Legarra A, Tsuruta S, Lourenco DA, Fragomeni BO, et al. Technical note: avoiding the direct inversion of the numerator relationship matrix for genotyped animals in single-step genomic best linear unbiased prediction solved with the preconditioned conjugate gradient. J Anim Sci. 2017;95:49–52.
CAS PubMed Google Scholar
Vandenplas J, Eding H, Bosmans M, Calus MPL. Computational strategies for the preconditioned conjugate gradient method applied to ssSNPBLUP with an application to a multivariate maternal model. Genet Sel Evol. 2020;52:24.
Article PubMed PubMed Central Google Scholar
Harville DA. Matrix algebra from a statistician’s perspective. New York: Springer; 2008.
Google Scholar
Hutchinson MF. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Commun Stat Simul Comput. 1990;19:433–50.
Article Google Scholar
Ahlfors LV. Complex analysis: an introduction to the theory of analytic functions of one complex variable. 3rd ed. New York: McGraw-Hill; 1979.
Google Scholar
Abramowitz M, Stegun IA. Handbook of mathematical functions. New York: Dover; 1972.
Google Scholar
Van Raden PM. Efficient methods to compute genomic predictions. J Dairy Sci. 2008;91:4414–23.
Article Google Scholar
Strandén I, Garrick DJ. Technical note: derivation of equivalent computing algorithms for genomic predictions and reliabilities of animal merit. J Dairy Sci. 2009;92:2971–5.
Article PubMed Google Scholar
Legarra A, Aguilar I, Misztal I. A relationship matrix including full pedigree and genomic information. J Dairy Sci. 2009;92:4656–63.
Article CAS PubMed Google Scholar
Henderson CR. Applications of linear models in animal breeding. Guelph: University of Guelph; 1984.
Google Scholar
Freudenberg A, Vandenplas J, Schlather M, Pook T, Evans R, Ten Napel J. Accelerated matrix-vector multiplications for matrices involving genotype covariates with applications in genomic prediction. Front Genet. 2023;14:1220408.
Article PubMed PubMed Central Google Scholar
Colleau JJ, Palhière I, Rodríguez-Ramilo ST, Legarra A. A fast indirect method to compute functions of genomic relationships concerning genotyped and ungenotyped individuals for diversity management. Genet Sel Evol. 2017;49:87.
Article PubMed PubMed Central Google Scholar
Mäntysaari EA, Evans RD, Strandén I. Efficient single-step genomic evaluation for a multibreed beef cattle population having many genotyped animals. J Anim Sci. 2017;95:4728–37.
Article PubMed PubMed Central Google Scholar
Golub GH, Van Loan CF. Matrix computations. 4th ed. Baltimore: Johns Hopkins University Press; 2015.
Google Scholar
Intel(R) oneAPI Math Kernel Library Version 2023.2; 2023.
Driscoll T. SC Toolbox: a user’s guide; 2005.
Van der Vorst HA, Melissen JBM. A petrov-galerkin type method for solving Axk=b where A is symmetric complex. IEEE Trans Magn. 1990;26:706–8.
Article Google Scholar
Rao CR. Linear statistical inference and its applications. New York: Wiley; 1973.
Book Google Scholar
Lourenco D, Tsuruta S, Masuda Y, Bermann M, Legarra A, Misztal I. Recent updates in the BLUPF90 software suite. In: Proceedings of the 12th world congress on genetics applied to livestock production; 3–8 July 2022; Rotterdam; 2022.
Masuda Y, Baba T, Suzuki M. Application of supernodal sparse factorization and inversion to the estimation of (co)variance components by residual maximum likelihood. J Anim Breed Genet. 2014;131:227–36.
Article CAS PubMed Google Scholar
Powell MJD. An efficient method for finding the minimum of a function of several variables without calculating derivatives. Comput J. 1964;7:155–62.
Article Google Scholar
Koval JJ. Algorithm AS 319: variable metric function minimization. J R Stat Soc Series C Appl Stat. 1997;46:515–21.
Article Google Scholar
García-Cortés LA, Sorensen D. Alternative implementations of Monte Carlo EM algorithms for likelihood inferences. Genet Sel Evol. 2001;33:443.
Article PubMed PubMed Central Google Scholar
Matilainen K, Mäntysaari EA, Lidauer MH, Strandén I, Thompson R. Employing a Monte Carlo algorithm in expectation maximization restricted maximum likelihood estimation of the linear mixed model. J Anim Breed Genet. 2012;129:457–68.
Article CAS PubMed Google Scholar
Matilainen K, Mäntysaari EA, Lidauer MH, Strandén I, Thompson R. Employing a Monte Carlo algorithm in Newton-type methods for restricted maximum likelihood estimation of genetic parameters. PLoS ONE. 2013;8: e80821.
Article PubMed PubMed Central Google Scholar
Matilainen K, Mäntysaari EA, Strandén I. Efficient monte carlo algorithm for restricted maximum likelihood estimation of genetic parameters. J Anim Breed Genet. 2019;136:252–61.
Article PubMed Google Scholar
Meyer K. Restricted maximum likelihood to estimate variance components for animal models with several random effects using a derivative-free algorithm. Genet Sel Evol. 1989;21:317–40.
Article PubMed Central Google Scholar
Legarra A, Bermann M, Mei Q, Christensen OF. Estimating genomic relationships of metafounders across and within breeds using maximum likelihood pseudo-expectation-maximization maximum likelihood and increase of relationships. Genet Sel Evol. 2024;56:35.
Article CAS PubMed PubMed Central Google Scholar
Nelder JA, Mead R. A simplex method for function minimization. Comput J. 1965;7:308–13.
Article Google Scholar
Alkhoder H, Liu Z, Reents R. The marker effects of a single-step random regression model for 4 test-day traits in German Holsteins. J Dairy Sci. 2024;107:423–37.
Article CAS PubMed Google Scholar
Cesarani A, Lourenco D, Bermann M, Nicolazzi EL, VanRaden PM, Misztal I. Single-step genomic predictions for crossbred Holstein and Jersey cattle in the United States. JDS Commun. 2023;5:124–8.
Article PubMed PubMed Central Google Scholar
Runcie DE, Qu J, Cheng H, Crawford L. MegaLMM: mega-scale linear mixed models for genomic predictions with thousands of traits. Genome Biol. 2021;22:213.
Article PubMed PubMed Central Google Scholar
Montesinos-López OA, Montesinos-López A, Crossa J, Kismiantini, Ramírez-Alcaraz JM, Singh R, et al. A singular value decomposition Bayesian multiple-trait and multiple-environment genomic model. Heredity. 2019;122:381–401.
Article PubMed Google Scholar
Xavier A, Runcie D, Habier D. Megavariate methods capture complex genotype-by-environment interactions. Genetics. 2025;229:iyae179.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We thank the American Angus Association, Cobb-Vantress, the US Holstein Association, and the Pig Improvement Company (PIC) for providing the data for this and previous studies.

Funding

This study was funded by Agriculture and Food Research Initiative Competitive Grant no. 2024-67015-42255 from the US Department of Agriculture’s National Institute of Food and Agriculture.

Author information

Authors and Affiliations

Department of Animal and Dairy Science, University of Georgia, Athens, GA, 30602, USA
Matias Bermann, Alejandra Alvarez-Munera, Ignacy Misztal & Daniela Lourenco
CDCB, Bowie, MD, 20716, USA
Andres Legarra
Instituto Nacional de Investigación Agropecuaria (INIA), 11500, Montevideo, Uruguay
Ignacio Aguilar

Authors

Matias Bermann
View author publications
Search author on:PubMed Google Scholar
Alejandra Alvarez-Munera
View author publications
Search author on:PubMed Google Scholar
Andres Legarra
View author publications
Search author on:PubMed Google Scholar
Ignacio Aguilar
View author publications
Search author on:PubMed Google Scholar
Ignacy Misztal
View author publications
Search author on:PubMed Google Scholar
Daniela Lourenco
View author publications
Search author on:PubMed Google Scholar

Contributions

MB conceived the study, run the analysis, and wrote the manuscript. MB and AAM adapted and implemented the algorithms. AL, IA, DL, and IM edited the manuscript. All the authors read and approved the final manuscript.

Corresponding author

Correspondence to Matias Bermann.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Bermann, M., Alvarez-Munera, A., Legarra, A. et al. Monte Carlo approximation of the logarithm of the determinant of large matrices with applications for linear mixed models in quantitative genetics. Genet Sel Evol 57, 44 (2025). https://doi.org/10.1186/s12711-025-00991-1

Download citation

Received: 12 December 2024
Accepted: 30 July 2025
Published: 06 August 2025
DOI: https://doi.org/10.1186/s12711-025-00991-1

Monte Carlo approximation of the logarithm of the determinant of large matrices with applications for linear mixed models in quantitative genetics

Abstract

Background

Results

Conclusions

Background

Methods

Theory

Application details

Materials

Example 1

Example 2

Results and discussion

Conclusions

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Genetics Selection Evolution

Contact us