Skip to main content
Log in

Random forest analysis and lasso regression outperform traditional methods in identifying missing data auxiliary variables when the MAR mechanism is nonlinear (p.s. Stop using Little’s MCAR test)

  • Original Manuscript
  • Published:
Behavior Research Methods Aims and scope Submit manuscript

Abstract

The selection of auxiliary variables is an important first step in appropriately implementing missing data methods such as full information maximum likelihood (FIML) estimation or multiple imputation. However, practical guidelines and statistical tests for selecting useful auxiliary variables are somewhat lacking, leading to potentially biased estimates. We propose the use of random forest analysis and lasso regression as alternative methods to select auxiliary variables, particularly in situations in which the missing data pattern is nonlinear or otherwise complex (i.e., interactive relationships between variables and missingness). Monte Carlo simulations demonstrate the effectiveness of random forest analysis and lasso regression compared to traditional methods (t-tests, Little’s MCAR test, logistic regressions), in terms of both selecting auxiliary variables and the performance of said auxiliary variables when incorporated in an analysis with missing data. Both techniques outperformed traditional methods, providing a promising direction for improvement of practical methods for handling missing data in statistical analyses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

Reiterating the open practices statement above, all simulation files and worked example code are available on an Open Science Framework Repository at: https://osf.io/q84ts/.

Notes

  1. We note that although extensions of both FIML and multiple imputation have been developed to handle MNAR missing data, we refer throughout the paper to the more widely known and used MAR-based versions of these methods—e.g., invoking FIML estimation under missing data by setting arguments missing = “FIML” and fixed.x = “TRUE” in the lavaan package in R, as in the simulation reported later in the paper.

  2. Note that the goal of satisfying the MAR assumption is aspirational but unverifiable in practice: in real datasets, researchers can never be certain that (a) they have identified true causes, as opposed to correlates, of missing data; (b) they have identified all such causes of missingness and all are measured and available in the dataset; and (c) missing values are not additionally caused by participants’ unseen scores on the variables in question, resulting in an analysis satisfying the MNAR mechanism. In other words, researchers can never be certain that the MAR assumption is (fully) met; rather, researchers can only render MAR more plausible by searching for and including useful auxiliary variables in analysis. In practice, researchers can never distinguish between MAR and MNAR mechanisms, as doing so would require access to participants’ unseen (missing) scores on all variables with missing data.

  3. Our collective experience collaborating with and providing statistical consultation for numerous substantive and applied researchers has led us to the firm conviction that successful convergence of complex multiple imputation models is by no means a foregone conclusion, especially when models incorporate complexities such as those listed above. The definition of “successful convergence” for multiple imputation is crucial to this conclusion. While on the user end one may achieve successful results with no warning message in most software packages, investigation of recommended imputation diagnostics might demonstrate untrustworthy performance (see, e.g., Enders, 2022; Hayes & Enders, 2023).

  4. Unless the researcher has decisive reasons to believe that the data are MCAR, such as when missing data are caused by a lab computer periodically crashing in a haphazard manner unrelated to participants’ characteristics or when the researcher has used a planned missing data design to purposefully inject MCAR missing data.

  5. Alternatively, the researcher might include all substantive model variables as well, e.g.,

    $$\text{ln}\left(\frac{{\widehat{p}}_{Miss}}{1-{\widehat{p}}_{Miss}}\right)={b}_{0}+{b}_{1}x+{b}_{2}{a}_{1}+{b}_{3}{a}_{2}+{b}_{4}{a}_{3}$$

    which would allow the researcher to assess whether candidate auxiliary variables \({a}_{1}\), \({a}_{2}\), and \({a}_{3}\) predicted missing data above and beyond the variable(s) in the substantive model (i.e., x, smoking attitudes, in the hypothetical example).

  6. Admittedly, this poses no shortcoming when assessing the types of inherently parabolic convex missing mechanisms under specific consideration in the present study, but may hinder generalizations to other, thornier, less orthodox functional forms of the relationship between auxiliary variables and missing data indicators.

  7. Note that this implies that the permutation importance test was conducted using marginal rather than partial variable importance, as described by Strobl et al. (2020). Based on pilot simulations, this procedure performed substantially better than partial variable importance measures. Because our goal here was not a detailed comparison of these options, however, we do not discuss partial importance measures further.

  8. Note that we also ran a set of analyses that included no auxiliary variables and that estimated the model using listwise deletion rather than FIML, using argument missing = “listwise” in lavaan. Because the results of these listwise analyses were identical to those of the “no auxiliary variable” FIML analyses, we opted to conserve space by omitting them from our presentation here.

  9. This can be said of the interactive mechanism here because it was designed to mimic the effects of a convex functional form, despite missing data rates depending on the values of two, rather than just one, auxiliary variables.

References

Download references

Funding

No funding was used to support this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Timothy Hayes.

Ethics declarations

Conflicts of interest

The authors have no conflicts of interest to disclose. 

Ethics approval, Informed consent, and Consent for publication

Not applicable for the simulated data used in the paper (no human subjects participated in this theoretical, simulation research).

Open practices statement

All simulation files and worked example code are available on an Open Science Framework Repository at: https://osf.io/q84ts/.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 105 KB)

Supplementary file2 (PPTX 136 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hayes, T., Baraldi, A.N. & Coxe, S. Random forest analysis and lasso regression outperform traditional methods in identifying missing data auxiliary variables when the MAR mechanism is nonlinear (p.s. Stop using Little’s MCAR test). Behav Res 56, 8608–8639 (2024). https://doi.org/10.3758/s13428-024-02494-1

Download citation

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.3758/s13428-024-02494-1

Keywords

Profiles

  1. Timothy Hayes