The R package mixgb provides a scalable approach for multiple imputation by leveraging XGBoost, subsampling and predictive mean matching. We have shown that our method can yield less biased estimates and reflect appropriate imputation variability, while achieving high computational efficiency. For further details, see our paper Multiple Imputation Through XGBoost.
New updates
New Development Version v2.2.3 on GitHub - Jan 2026
- Support saving intermediate imputation summary statistics so that they can be passed to the vismi package for convergence diagnostics.
New Release on CRAN - Dec 2025
New CRAN version 2.0.3. Now compatible with XGBoost (3.1.2.1) on CRAN.
This update addresses breaking changes introduced in the latest XGBoost release on CRAN in December 2025. If you experience any problem, please post an issue on GitHub.
1. Installation
You can install the development version of mixgb from GitHub with:
# install.packages("devtools")
devtools::install_github("agnesdeng/mixgb")2. Sanity check input data before imputation
Common issues
Please clean and check your data before imputation. Here are some common issues:
- Data should be a data frame.
- ID should be removed
- Missing values should be coded as
NAnotNaN -
Infor-Infare not allowed - Empty cells should be coded as
NAor sensible values - Variables of “character” type should be converted to “factor”
- Variables of “factor” type should have at least two levels
Using check_data() for preliminary data check
The function check_data() performs a preliminary check and attempts to fix some obvious issues.
Step 1:
datainput needs to be a data.frame, tibble or data.tableStep 2: convert character variables to factor type if
stringAsFactors = TRUEStep 3: convert
NaN,"NaN"toNAStep 4: convert
Inf,-Inf,"Inf","-Inf"toNAStep 5: convert empty strings
""toNAStep 6: check factor variables with only single level and ask user whether to remove them
Step 7: check factor variables with too many levels (more than
max_levels) and ask user whether to keep themReturn: preliminarily checked data, with missing values encoded as NA.
Please note that this function serves as a reminder of how important data cleaning is and it almost surely cannot address all data quality issues :)
bad_data <- newborn
bad_data[, "ethnicity"] <- as.character(bad_data[, "ethnicity",
drop = TRUE])
bad_data[2, "ethnicity"] <- "NaN"
bad_data[, "age_months"] <- as.factor(bad_data[, "age_months",
drop = TRUE])
bad_data[, 1] <- ""
bad_data[, "sex"] <- "Unknown"
bad_data[4, "weight_kg"] <- NaN
bad_data[5, "weight_kg"] <- Inf
checked_data < -check_data(data = bad_data)3. Impute missing values with mixgb
Please read: https://agnesdeng.github.io/mixgb/articles/Using-mixgb.html
4. Impute new unseen data using a saved imputer object
Please read: https://agnesdeng.github.io/mixgb/articles/Imputing-newdata.html
5. Visualisation Diagnostics for Multiple Imputation
It is crucial to assess the plausibility of imputations before doing any analysis.
We have a standalone R package vismi (Visualisation Diagnostics for Multiple Imputation) which provides a comprehensive suite of diagnostics for assessing the quality of multiply imputed data. The package supports imputed data generated by various multiple imputation methods, including mixgb, mice, and more.
For more details, please check: https://agnesdeng.github.io/vismi/
6. Install mixgb with GPU support
Multiple imputation can be run with GPU support for machines with NVIDIA GPUs. Users must first install the R package xgboost with GPU support.
6.2 Older Version
Please refer to the instructions for the older version below if you want to use GPU support with mixgb.
XGBoost >= 2.0.0, mixgb >= 1.3.1
Please download the Newest version of XGBoost with GPU support via XGBoost GitHub Releases.
# Change the file path where you saved the downloaded
# XGBoost package
install.packages("path_to_downloaded_file/xgboost_r_gpu_win64_2.0.0.tar.gz",
repos = NULL)Then users can install the newest version of our package mixgb in R.
devtools::install_github("agnesdeng/mixgb")
library(mixgb)To utilise the GPU version of mixgb(), users can simply specify device = "cuda" in the params list which will then be passed to the xgb.params argument in the function mixgb(). Note that by default, tree_method = "hist" from XGBoost 2.0.0.
XGBoost < 2.0.0, mixgb < 1.3.1
The xgboost R package pre-built binary on Linux x86_64 with GPU support can be downloaded from the release page https://github.com/dmlc/xgboost/releases/tag/v1.4.0
The package can then be installed by running the following commands:
# Install dependencies
$ R -q -e "install.packages(c('data.table', 'jsonlite'))"
# Install XGBoost
$ R CMD INSTALL ./xgboost_r_gpu_linux.tar.gzThen users can install package mixgb in R.
devtools::install_github("agnesdeng/mixgb")
library(mixgb)To utilise the GPU version of mixgb(), users can simply specify tree_method = "gpu_hist" in the params list which will then be passed to the xgb.params argument in the function mixgb(). Other adjustable GPU-related arguments include gpu_id and predictor. By default, gpu_id = 0 and predictor = "auto".
