Skip to content

Native support for parallel imputation#701

Open
stefvanbuuren wants to merge 8 commits intomasterfrom
future.apply
Open

Native support for parallel imputation#701
stefvanbuuren wants to merge 8 commits intomasterfrom
future.apply

Conversation

@stefvanbuuren
Copy link
Member

@stefvanbuuren stefvanbuuren commented Apr 6, 2025

Experimental feature

This PR adds native support for parellel imputation to mice().

  • The mice() function now supports parallel execution of imputations via the new parallel = TRUE argument. When enabled, instead of sequentially calculating m imputations at a given iteration, the m chains are distributed across available CPU cores using the future and future.apply frameworks.

  • Parallel imputation may significantly reduce runtime, especially for large datasets and many imputations (m), but does not pay-off for small datasets or few imputations.

  • Parallel execution is implemented only in the mice() function, and does not affect the mice.impute.*() functions.

  • To activate parallel execution:

library(mice)
imp <- mice(data, parallel = TRUE)
  • The default is parallel = FALSE for backward compatibility.
  • The argument n.core specifies the number of CPU cores to use. If n.core is not specified (default) the actual number of cores used is calculated as minimum(number of available cores - 1, number of imputations).
  • printFlag = TRUE prints iteration and imputation number only in sequential mode; parallel mode reports timing per iteration.
  • Note: mice() will automatically select a parallel backend (default is multisession). To override, users may manually call plan(...) before running mice().
  • The future and future.apply packages must be installed to run parallel imputation. If not installed, mice() will throw an error and suggest installing the packages.
  • The wrappers parlmice() and futuremice() are still functional, but now throw a warning that they will be deprecated in the future. Users are encouraged to use the new parallel argument in mice() instead.

Notes and questions:

  • Not yet finalized, but far enough to play around and test. As far as I can see, everything functions properly, but I did not perform any specific tests for parallelization, time gains, and so on.
  • I implemented only the minimal parallel and n.core argument for mice(). To keep the mice API clean, I suggest setting alternative future plans outside mice(). Does that sound like a good idea?
  • seeds are tricky for parallel environments. It is not yet clear whether we can borrow the standard seed argument from mice() to get reproducible solutions. Do we need to add a few tests for this?
  • Should we create a vignette, similar to https://www.gerkovink.com/parlMICE/Vignette_parlMICE.html and https://www.gerkovink.com/miceVignettes/futuremice/Vignette_futuremice.html to help users to get started?
  • Do we miss any arguments from those vignettes that we should support in mice()?
  • In parallel mode, isTRUE(printFlag) generates the time per iteration. Is that appropriate, or better report something else?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 6 out of 24 changed files in this pull request and generated no comments.

Files not reviewed (18)
  • DESCRIPTION: Language not supported
  • NAMESPACE: Language not supported
  • R/cbind.R: Language not supported
  • R/complete.R: Language not supported
  • R/edit.setup.R: Language not supported
  • R/futuremice.R: Language not supported
  • R/initialize.imp.R: Language not supported
  • R/internal.R: Language not supported
  • R/mice.R: Language not supported
  • R/parlmice.R: Language not supported
  • R/sampler.R: Language not supported
  • R/sampler.univ.R: Language not supported
  • man/futuremice.Rd: Language not supported
  • man/mice.Rd: Language not supported
  • man/parlmice.Rd: Language not supported
  • man/record.event.Rd: Language not supported
  • tests/testthat/test-as.mids.R: Language not supported
  • tests/testthat/test-mice.impute.norm.R: Language not supported
Comments suppressed due to low confidence (2)

_pkgdown.yml:131

  • Verify that the new 'record.event' reference corresponds to an existing documentation page to avoid broken links.
  - record.event

_pkgdown.yml:184

  • Ensure that the 'developer-notes-complete' page is properly linked and exists in the documentation to support developer guidance.
  - developer-notes-complete

@stefvanbuuren
Copy link
Member Author

Well, thanks Copilot for your informative review.

@stefvanbuuren stefvanbuuren changed the base branch from master to dev April 9, 2025 15:26
@stefvanbuuren stefvanbuuren changed the base branch from dev to master April 9, 2025 15:30
@stefvanbuuren
Copy link
Member Author

stefvanbuuren commented May 1, 2025

We also need support for custom imputation methods (cf #550).

Merge branch 'master' into future.apply

# Conflicts:
#	DESCRIPTION
#	NEWS.md
#	R/edit.setup.R
#	R/mice.R
#	R/sampler.R
#	_pkgdown.yml
@stefvanbuuren
Copy link
Member Author

See alexanderrobitzsch/miceadds#30 for an example with miceadds integration.

@stefvanbuuren
Copy link
Member Author

Development note

  • The future.apply branch is merged into de dev branch. Any further development are made there.
  • For better integration and backward compatibility, the dev branch no longer supports the new record.event() function and the logenv environment. Logging of the imputation process relies - as before - on the loggedEvents object and the updateLog() function.


# Set up parallel backend if requested
if (parallel) {
if (!requireNamespace("future.apply", quietly = TRUE)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could consider making future.apply a hard dependency. Then, if parallel = FALSE, you just specify a sequential backend, and if parallel = TRUE. The advantage of this would be that the code is cleaner and easier to maintain (there is just one loop instead of the if-else later on). The disadvantage is that a future::apply() with sequential backend may use the rng state differently, and may thus not reproduce the original results.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting idea that simplifies the main loop in sampler() at the price of an additional dependency. I would favor it if the algorithm is exactly reproducible. It needs not be backward compatible (impossible to maintain over versions).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be reproducible at the same machine, but perhaps not across platforms. We can test this, I suppose. I think, but am not entirely certain, that mice will then also be reproducible regardless of whether the replicator uses parallel = TRUE. I can dive into this.

R/sampler.R Outdated
}
log_i <- if (exists("loggedEvents", inherits = FALSE)) get("loggedEvents", inherits = FALSE) else NULL
list(imp = imp_i, mean = mean_i, var = var_i, log = log_i)
}, future.seed = TRUE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the dotdotdot here when closing the future_lapply() call, then users can specify future.packages, future.globals and eventually all other functionality that is available. This allows users to specify their own imputation functions, for example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can support a user-specified list future.globals as mice dots argument. It should be strainghtforward to adapt the code.

log_i <- if (exists("loggedEvents", inherits = FALSE)) get("loggedEvents", inherits = FALSE) else NULL
list(imp = imp_i, mean = mean_i, var = var_i, log = log_i)
},
future.packages = future.packages,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Stef, I think the current code does not allow users to have their self-written custom imputation functions as imputation function, unless they save these within a custom package. Perhaps you can allow users to specify a list with objects (functions/variables) for future.globals, and append these two functions to it (if non-null). Alternatively, if these run_imputation_cycle and update_chain_stats functions are properly added to the mice package, calling these through globals is not necessary, as the futures should have access to the entire mice package.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that require current run_imputation_cycle() and update_chain_stats() to be exported functions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. That would imply that many unexported functions that are called under the hood should be exported explicitly. It's more likely that all functions in a package are available in downstream futures.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this mostly concerns that functions specified in the $meth slot may not be available for the future workers, would it just make sense to scan $meth, see if the functions only exist in .GlobalEnv and throw a warning to the user?

@stefvanbuuren
Copy link
Member Author

To install the experimental version use

remotes::install_github("amices/mice@refs/pull/701/head")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants