Multiple Imputation Through XGBoost • mixgb

The R package mixgb provides a scalable approach for multiple imputation by leveraging XGBoost, subsampling and predictive mean matching. We have shown that our method can yield less biased estimates and reflect appropriate imputation variability, while achieving high computational efficiency. For further details, see our paper Multiple Imputation Through XGBoost.

New updates

New Development Version v2.2.3 on GitHub - Jan 2026

Support saving intermediate imputation summary statistics so that they can be passed to the vismi package for convergence diagnostics.

New Release on CRAN - Dec 2025

New CRAN version 2.0.3. Now compatible with XGBoost (3.1.2.1) on CRAN.
This update addresses breaking changes introduced in the latest XGBoost release on CRAN in December 2025. If you experience any problem, please post an issue on GitHub.

1. Installation

You can install the development version of mixgb from GitHub with:

# install.packages("devtools")
devtools::install_github("agnesdeng/mixgb")

library(mixgb)

2. Sanity check input data before imputation

Common issues

Please clean and check your data before imputation. Here are some common issues:

Data should be a data frame.
ID should be removed
Missing values should be coded as NA not NaN
Inf or -Inf are not allowed
Empty cells should be coded as NA or sensible values
Variables of “character” type should be converted to “factor”
Variables of “factor” type should have at least two levels

Using `check_data()` for preliminary data check

The function check_data() performs a preliminary check and attempts to fix some obvious issues.

Step 1: data input needs to be a data.frame, tibble or data.table
Step 2: convert character variables to factor type if stringAsFactors = TRUE
Step 3: convert NaN, "NaN" to NA
Step 4: convert Inf, -Inf, "Inf", "-Inf" to NA
Step 5: convert empty strings "" to NA
Step 6: check factor variables with only single level and ask user whether to remove them
Step 7: check factor variables with too many levels (more than max_levels) and ask user whether to keep them
Return: preliminarily checked data, with missing values encoded as NA.

Please note that this function serves as a reminder of how important data cleaning is and it almost surely cannot address all data quality issues :)

bad_data <- newborn
bad_data[, "ethnicity"] <- as.character(bad_data[, "ethnicity",
    drop = TRUE])
bad_data[2, "ethnicity"] <- "NaN"
bad_data[, "age_months"] <- as.factor(bad_data[, "age_months",
    drop = TRUE])
bad_data[, 1] <- ""
bad_data[, "sex"] <- "Unknown"
bad_data[4, "weight_kg"] <- NaN
bad_data[5, "weight_kg"] <- Inf

checked_data < -check_data(data = bad_data)

3. Impute missing values with `mixgb`

Please read: https://agnesdeng.github.io/mixgb/articles/Using-mixgb.html

4. Impute new unseen data using a saved imputer object

Please read: https://agnesdeng.github.io/mixgb/articles/Imputing-newdata.html

5. Visualisation Diagnostics for Multiple Imputation

It is crucial to assess the plausibility of imputations before doing any analysis.

We have a standalone R package vismi (Visualisation Diagnostics for Multiple Imputation) which provides a comprehensive suite of diagnostics for assessing the quality of multiply imputed data. The package supports imputed data generated by various multiple imputation methods, including mixgb, mice, and more.

For more details, please check: https://agnesdeng.github.io/vismi/

6. Install `mixgb` with GPU support

Multiple imputation can be run with GPU support for machines with NVIDIA GPUs. Users must first install the R package xgboost with GPU support.

6.1 Newest Version

XGBoost >= 3.1.2.1, mixgb >= 2.0.3

XGBoost has recently introduced a breaking change. The current mixgb release on CRAN (≥ 2.0.3) is compatible with XGBoost 3.1.2.1. GPU support has not yet been fully tested with this update. I’ll test GPU support with this update as soon as I have time.

6.2 Older Version

Please refer to the instructions for the older version below if you want to use GPU support with mixgb.

XGBoost >= 2.0.0, mixgb >= 1.3.1

Please download the Newest version of XGBoost with GPU support via XGBoost GitHub Releases.

# Change the file path where you saved the downloaded
# XGBoost package
install.packages("path_to_downloaded_file/xgboost_r_gpu_win64_2.0.0.tar.gz",
    repos = NULL)

Then users can install the newest version of our package mixgb in R.

devtools::install_github("agnesdeng/mixgb")
library(mixgb)

To utilise the GPU version of mixgb(), users can simply specify device = "cuda" in the params list which will then be passed to the xgb.params argument in the function mixgb(). Note that by default, tree_method = "hist" from XGBoost 2.0.0.

params <- list(device = "cuda", subsample = 0.7, nthread = 1,
    tree_method = "hist")

mixgb.data <- mixgb(data = withNA.df, m = 5, xgb.params = params)

XGBoost < 2.0.0, mixgb < 1.3.1

The xgboost R package pre-built binary on Linux x86_64 with GPU support can be downloaded from the release page https://github.com/dmlc/xgboost/releases/tag/v1.4.0

The package can then be installed by running the following commands:

# Install dependencies
$ R -q -e "install.packages(c('data.table', 'jsonlite'))"

# Install XGBoost
$ R CMD INSTALL ./xgboost_r_gpu_linux.tar.gz

Then users can install package mixgb in R.

devtools::install_github("agnesdeng/mixgb")
library(mixgb)

To utilise the GPU version of mixgb(), users can simply specify tree_method = "gpu_hist" in the params list which will then be passed to the xgb.params argument in the function mixgb(). Other adjustable GPU-related arguments include gpu_id and predictor. By default, gpu_id = 0 and predictor = "auto".

params <- list(max_depth = 3, subsample = 0.7, nthread = 1, tree_method = "gpu_hist",
    gpu_id = 0, predictor = "auto")


mixgb.data <- mixgb(data = withNA.df, m = 5, xgb.params = params)

Notice

For multithreading, users can set the XGBoost nthread parameter with OpenMP support. Be advised, OpenMP support is currently disabled on MacOS.

mixgb