Abstract
The selection of auxiliary variables is an important first step in appropriately implementing missing data methods such as full information maximum likelihood (FIML) estimation or multiple imputation. However, practical guidelines and statistical tests for selecting useful auxiliary variables are somewhat lacking, leading to potentially biased estimates. We propose the use of random forest analysis and lasso regression as alternative methods to select auxiliary variables, particularly in situations in which the missing data pattern is nonlinear or otherwise complex (i.e., interactive relationships between variables and missingness). Monte Carlo simulations demonstrate the effectiveness of random forest analysis and lasso regression compared to traditional methods (t-tests, Little’s MCAR test, logistic regressions), in terms of both selecting auxiliary variables and the performance of said auxiliary variables when incorporated in an analysis with missing data. Both techniques outperformed traditional methods, providing a promising direction for improvement of practical methods for handling missing data in statistical analyses.






Similar content being viewed by others
Data availability
Reiterating the open practices statement above, all simulation files and worked example code are available on an Open Science Framework Repository at: https://osf.io/q84ts/.
Notes
We note that although extensions of both FIML and multiple imputation have been developed to handle MNAR missing data, we refer throughout the paper to the more widely known and used MAR-based versions of these methods—e.g., invoking FIML estimation under missing data by setting arguments missing = “FIML” and fixed.x = “TRUE” in the lavaan package in R, as in the simulation reported later in the paper.
Note that the goal of satisfying the MAR assumption is aspirational but unverifiable in practice: in real datasets, researchers can never be certain that (a) they have identified true causes, as opposed to correlates, of missing data; (b) they have identified all such causes of missingness and all are measured and available in the dataset; and (c) missing values are not additionally caused by participants’ unseen scores on the variables in question, resulting in an analysis satisfying the MNAR mechanism. In other words, researchers can never be certain that the MAR assumption is (fully) met; rather, researchers can only render MAR more plausible by searching for and including useful auxiliary variables in analysis. In practice, researchers can never distinguish between MAR and MNAR mechanisms, as doing so would require access to participants’ unseen (missing) scores on all variables with missing data.
Our collective experience collaborating with and providing statistical consultation for numerous substantive and applied researchers has led us to the firm conviction that successful convergence of complex multiple imputation models is by no means a foregone conclusion, especially when models incorporate complexities such as those listed above. The definition of “successful convergence” for multiple imputation is crucial to this conclusion. While on the user end one may achieve successful results with no warning message in most software packages, investigation of recommended imputation diagnostics might demonstrate untrustworthy performance (see, e.g., Enders, 2022; Hayes & Enders, 2023).
Unless the researcher has decisive reasons to believe that the data are MCAR, such as when missing data are caused by a lab computer periodically crashing in a haphazard manner unrelated to participants’ characteristics or when the researcher has used a planned missing data design to purposefully inject MCAR missing data.
Alternatively, the researcher might include all substantive model variables as well, e.g.,
$$\text{ln}\left(\frac{{\widehat{p}}_{Miss}}{1-{\widehat{p}}_{Miss}}\right)={b}_{0}+{b}_{1}x+{b}_{2}{a}_{1}+{b}_{3}{a}_{2}+{b}_{4}{a}_{3}$$which would allow the researcher to assess whether candidate auxiliary variables \({a}_{1}\), \({a}_{2}\), and \({a}_{3}\) predicted missing data above and beyond the variable(s) in the substantive model (i.e., x, smoking attitudes, in the hypothetical example).
Admittedly, this poses no shortcoming when assessing the types of inherently parabolic convex missing mechanisms under specific consideration in the present study, but may hinder generalizations to other, thornier, less orthodox functional forms of the relationship between auxiliary variables and missing data indicators.
Note that this implies that the permutation importance test was conducted using marginal rather than partial variable importance, as described by Strobl et al. (2020). Based on pilot simulations, this procedure performed substantially better than partial variable importance measures. Because our goal here was not a detailed comparison of these options, however, we do not discuss partial importance measures further.
Note that we also ran a set of analyses that included no auxiliary variables and that estimated the model using listwise deletion rather than FIML, using argument missing = “listwise” in lavaan. Because the results of these listwise analyses were identical to those of the “no auxiliary variable” FIML analyses, we opted to conserve space by omitting them from our presentation here.
This can be said of the interactive mechanism here because it was designed to mimic the effects of a convex functional form, despite missing data rates depending on the values of two, rather than just one, auxiliary variables.
References
Arbuckle, J. N. (1996). Full information estimation in the presence of incomplete data. In Advanced structural equation modeling. (pp. 243–277). Lawrence Erlbaum Associates. Inc.
Berk, R. A. (2009). Statistical learning from a regression perspective. Springer.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. https://doi.org/10.1007/BF00058655
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression Trees. Wadsworth.
Cohen, J., Cohen, P., Aiken, L. S., & West, S. G. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd Ed.). Lawrence Erlbaum Associates, Inc.
Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6(4), 330–351. https://doi.org/10.1037/1082-989X.6.4.330
Debeer, D., Hothorn, T., & Strobl, C. (2021). permimp: Conditional Permutation Importance (R package version 1.0–2). https://CRAN.R-project.org/package=permimp
Debeer, D., & Strobl, C. (2020). Conditional permutation importance revisited. BMC Bioinformatics, 21(1), 307. https://doi.org/10.1186/s12859-020-03622-2
Dixon, W. J. (1988). BMDP statistical software. University of California Press.
Enders, C. K. (2021). Applied missing data analysis (2nd ed.). Manuscript in press at Guilford Press.
Enders, C. K. (2022). Applied missing data analysis (2nd Ed.). The Guilford Press.
Enders, C. K. (2023). Fitting structural equation models with missing data. In Handbook of structural equation modeling (2nd Ed., pp. 223–240). The Guilford Press.
Enders, C. K., Du, H., & Keller, B. T. (2020). A model-based imputation procedure for multilevel regression models with random coefficients, interaction effects, and nonlinear terms. Psychological Methods, 25(1), 88–112. https://doi.org/10.1037/met0000228
Graham, J. W. (2003). Adding missing-data-relevant variables to FIML-based structural equation models. Structural Equation Modeling: A Multidisciplinary Journal, 10(1), 80–100. https://doi.org/10.1207/S15328007SEM1001_4
Grund, S., Lüdtke, O., & Robitzsch, A. (2021). Multiple imputation of missing data in multilevel models with the R package mdmb: A flexible sequential modeling approach. Behavior Research Methods. https://doi.org/10.3758/s13428-020-01530-0
Hapfelmeier, A., Hothorn, T., Ulm, K., & Strobl, C. (2014). A new variable importance measure for random forests with missing data. Statistics and Computing, 24(1), 21–34. https://doi.org/10.1007/s11222-012-9349-1
Hapfelmeier, A., & Ulm, K. (2013). A new variable selection approach using Random Forests. Computational Statistics & Data Analysis, 60, 50–69. https://doi.org/10.1016/J.CSDA.2012.09.020
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning. Springer-Verlag.
Hayes, T., & Enders, C. K. (2023). Maximum Likelihood and Multiple Imputation Missing Data Handling: How They Work, and How to Make Them Work in Practice. In H. Cooper, A. Panter, D. Rindskopf, K. , Sher, M. Coutanche, & L. McMullen (Eds.), APA Handbook of Research Methods in Psychology (2nd Ed.). American Psychological Association.
Hoerl, A. E., & Kennard, R. W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12(1), 55. https://doi.org/10.2307/1267351
Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15(3), 651–674. https://doi.org/10.1198/106186006X133933
IBM Corp. (2022). IBM SPSS Statistics for Macintosh, Version 29.0. IBM Corp.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning with applications in R (2nd Ed.). Springer.
Jamshidian, M., & Jalal, S. (2010). Tests of Homoscedasticity, Normality, and Missing Completely at Random for Incomplete Multivariate Data. Psychometrika, 75(4), 649–674. https://doi.org/10.1007/s11336-010-9175-3
Jamshidian, M., Jalal, S., & Jansen, C. (2014). MissMech : An R Package for Testing Homoscedasticity, Multivariate Normality, and Missing Completely at Random (MCAR). Journal of Statistical Software, 56(6), 1–31. https://doi.org/10.18637/jss.v056.i06
Jeliĉić, H., Phelps, E., & Lerner, R. M. (2009). Use of missing data methods in longitudinal studies: The persistence of bad practices in developmental psychology. Developmental Psychology, 45(4), 1195–1199. https://doi.org/10.1037/a0015665
Kim, K. H., & Bentler, P. M. (2002). Tests of homogeneity of means and covariance matrices for multivariate incomplete data. Psychometrika, 67(4), 609–624. https://doi.org/10.1007/BF02295134
Kursa, M. B., & Rudnicki, W. R. (2010). Feature Selection with the Boruta Package. Journal of Statistical Software, 36(11), 1–13. https://doi.org/10.18637/jss.v036.i11
Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R News, 2(3), 12–22.
Little, R. J. A. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83(404), 1198–1202. https://doi.org/10.1080/01621459.1988.10478722
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). Wiley. https://doi.org/10.1002/9781119013563
Muthén, B., Kaplan, D., & Hollis, M. (1987). On structural equation modeling with data that are not missing completely at random. Psychometrika, 52(3), 431–462. https://doi.org/10.1007/BF02294365
Nicholson, J. S., Deboeck, P. R., & Howard, W. (2017). Attrition in developmental psychology: A review of modern missing data reporting and practices. International Journal of Behavioral Development, 41(1), 143–153. https://doi.org/10.1177/0165025415618275
Park, T., & Lee, S.-Y. (1997). A test of missing completely at random for longitudinal data with missing observations. Statistics in Medicine, 16(16), 1859–1871. https://doi.org/10.1002/(SICI)1097-0258(19970830)16:16%3c1859::AID-SIM593%3e3.0.CO;2-3
R Core Team. (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing. http://r-project.org/
Raghunathan, T. E. (2004). What Do We Do with Missing Data? Some Options for Analysis of Incomplete Data. Annual Review of Public Health, 25(1), 99–117. https://doi.org/10.1146/annurev.publhealth.25.102802.124410
Raykov, T., & Marcoulides, G. A. (2014). Identifying Useful Auxiliary Variables for Incomplete Data Analyses. Educational and Psychological Measurement, 74(3), 537–550. https://doi.org/10.1177/0013164413511326
Raykov, T., & West, B. T. (2016). On enhancing plausibility of the missing at random assumption in incomplete data analyses via evaluation of response-auxiliary variable correlations. Structural Equation Modeling, 23(1), 45–53. https://doi.org/10.1080/10705511.2014.937848
Rosseel, Y. (2012). lavaan : An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36. https://doi.org/10.18637/jss.v048.i02
Rothacher, Y., & Strobl, C. (2023a). Identifying Informative Predictor Variables with Random Forests. Journal of Educational and Behavioral Statistics, Advance Online Publication. https://doi.org/10.3102/10769986231193327
Rothacher, Y., & Strobl, C. (2023b). Identifying Informative Predictor Variables With Random Forests. Journal of Educational and Behavioral Statistics. https://doi.org/10.3102/10769986231193327
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592. https://doi.org/10.2307/2335739
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. Wiley.
Savalei, V., & Bentler, P. M. (2009). A two-stage approach to missing data: Theory and application to auxiliary variables. Structural Equation Modeling, 16(3), 477–497. https://doi.org/10.1080/10705510903008238
Schafer, J. L. (1997). Analysis of incomplete multivariate data. Chapman & Hall.
Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests. BMC Bioinformatics, 9(1), 307. https://doi.org/10.1186/1471-2105-9-307
Strobl, C., Boulesteix, A.-L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8(1), 25. https://doi.org/10.1186/1471-2105-8-25
Strobl, C., Malley, J., & Tutz, G. (2009). An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods, 14(4), 323–348. https://doi.org/10.1037/a0016973
Tay, J. K., Narasimhan, B., & Hastie, T. (2023). Elastic Net Regularization Paths for All Generalized Linear Models. Journal of Statistical Software, 106(1), 1–31. https://doi.org/10.18637/jss.v106.i01
Thoemmes, F., & Rose, N. (2014). A Cautious Note on Auxiliary Variables That Can Increase Bias in Missing Data Problems. Multivariate Behavioral Research, 49(5), 443–459. https://doi.org/10.1080/00273171.2014.931799
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Source: Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288. https://www.jstor.org/stable/pdf/2346178.pdf?refreqid=fastly-default%3Ab4a52e9774c54c338a7426faa2779e6e&ab_segments=&origin=&initiator=&acceptTC=1
van Ginkel, J. R., Linting, M., Rippe, R. C. A., & van der Voort, A. (2020). Rebutting existing misconceptions about multiple imputation as a method for handling missing data. Journal of Personality Assessment, 102(3), 297–308. https://doi.org/10.1080/00223891.2018.1530680
Woods, A. D., Gerasimova, D., Dusen, B. Van, Nissen, J., Bainter, S., Uzdavines, A., Davis-Kean, P., Halvorson, M. A., King, K., Logan, J., Xu, M., Vasilev, M. R., Clay, J. M., Moreau, D., Joyal-Desmarais, K., Cruz, R. A., Brown, D., Schmidt, K., & Elsherif, M. (2023). Best Practices for Addressing Missing Data through Multiple Imputation. PsyArXiv. https://doi.org/10.31234/OSF.IO/UAEZH
Yuan, K.-H., Jamshidian, M., & Kano, Y. (2018). Missing Data Mechanisms and Homogeneity of Means and Variances-Covariances. Psychometrika, 83(2), 425–442. https://doi.org/10.1007/s11336-018-9609-x
Zhang, Q., & Wang, L. (2017). Moderation analysis with missing data in the predictors. Psychological Methods, 22(4), 649–666. https://doi.org/10.1037/met0000104
Funding
No funding was used to support this research.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors have no conflicts of interest to disclose.
Ethics approval, Informed consent, and Consent for publication
Not applicable for the simulated data used in the paper (no human subjects participated in this theoretical, simulation research).
Open practices statement
All simulation files and worked example code are available on an Open Science Framework Repository at: https://osf.io/q84ts/.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hayes, T., Baraldi, A.N. & Coxe, S. Random forest analysis and lasso regression outperform traditional methods in identifying missing data auxiliary variables when the MAR mechanism is nonlinear (p.s. Stop using Little’s MCAR test). Behav Res 56, 8608–8639 (2024). https://doi.org/10.3758/s13428-024-02494-1
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.3758/s13428-024-02494-1


