mixgb is a scalable multiple imputation framework based on XGBoost, bootstrapping and predictive mean matching. The proposed framework is implemented in an R package mixgb. We have shown that our framework obtains less biased estimates and reflects appropriate imputation variability, while achieving high computational efficiency. For more details, please check our paper https://arxiv.org/abs/2106.01574. (Our package has been revised and updated since the preprint was posted. Simulation code in the original supplementary files may not run as expected. Revised paper and adapted code will be updated soon.)

May 2022

• Visual diagnostic functions of multiply imputed data
• Use S3 instead of R6
• Plot functions can show masked missing data (if provided)

April 2022

• User can now set different number of iterations maxit.
• Both single and multiple imputation with XGBoost can do predictive mean matching
• Bootstrap data to make m imputations is optional. User can set bootstrap = FALSE to disable bootstrap. Users can also set sampling related hyperparameters of XGBoost (subsample, colsample_bytree, colsample_bylevel, colsample_bynode) to be less than 1 to achieve similar effect.
• Add predicted mean matching type 0. Now the options for pmm.type are NULL,0,1,2 or "auto" (type 2 for numeric/integer variables, no PMM for categorical variables).
• Compatible with data.table
• Cross-validation to pre-tune nrounds

## Under development

• make m imputation parallel
• pre-tune other hyperparameters

## Notice

• Currently users can set XGBoost parameter nthread for multithreading with OpenMP support. However, MacOS has disabled OpenMP support.

## 1. Installation

You can install the development version of mixgb from GitHub with:

# install.packages("devtools")
devtools::install_github("agnesdeng/mixgb")

### 1.1 Data cleaning before imputation

It is highly recommended to clean and check your data before imputation. Here are some common issues:

• Data should be a data frame.
• ID should be removed
• Missing values should be coded as NA not NaN
• Inf or -Inf are not allowed
• Empty cells should be coded as NA or sensible values
• Variables of “character” type should be converted to “factor” instead
• Variables of “factor” type should have at least two levels

The function data_clean() can do a preliminary check and fix some obvious problems. However, it would not fix all issues related to data quality.

cleanWithNA.df <- data_clean(rawdata)

## 2. Impute missing values with mixgb

We first load the mixgb package and the nhanes3_newborn dataset, which contains 16 variables of various types (integer/numeric/factor/ordinal factor). There are 9 variables with missing values.

str(nhanes3_newborn)
#> tibble [2,107 × 16] (S3: tbl_df/tbl/data.frame)
#>  $HSHSIZER: int [1:2107] 4 3 5 4 4 3 5 3 3 3 ... #>$ HSAGEIR : int [1:2107] 2 5 10 10 8 3 10 7 2 7 ...
#>  $HSSEX : Factor w/ 2 levels "1","2": 2 1 2 2 1 1 2 2 2 1 ... #>$ DMARACER: Factor w/ 3 levels "1","2","3": 1 1 2 1 1 1 2 1 2 2 ...
#>  $DMAETHNR: Factor w/ 3 levels "1","2","3": 3 1 3 3 3 3 3 3 3 3 ... #>$ DMARETHN: Factor w/ 4 levels "1","2","3","4": 1 3 2 1 1 1 2 1 2 2 ...
#>  $BMPHEAD : num [1:2107] 39.3 45.4 43.9 45.8 44.9 42.2 45.8 NA 40.2 44.5 ... #> ..- attr(*, "label")= chr "Head circumference (cm)" #>$ BMPRECUM: num [1:2107] 59.5 69.2 69.8 73.8 69 61.7 74.8 NA 64.5 70.2 ...
#>   ..- attr(*, "label")= chr "Recumbent length (cm)"
#>  $BMPSB1 : num [1:2107] 8.2 13 6 8 8.2 9.4 5.2 NA 7 5.9 ... #> ..- attr(*, "label")= chr "First subscapular skinfold (mm)" #>$ BMPSB2  : num [1:2107] 8 13 5.6 10 7.8 8.4 5.2 NA 7 5.4 ...
#>   ..- attr(*, "label")= chr "Second subscapular skinfold (mm)"
#>  $BMPTR1 : num [1:2107] 9 15.6 7 16.4 9.8 9.6 5.8 NA 11 6.8 ... #> ..- attr(*, "label")= chr "First triceps skinfold (mm)" #>$ BMPTR2  : num [1:2107] 9.4 14 8.2 12 8.8 8.2 6.6 NA 10.9 7.6 ...
#>   ..- attr(*, "label")= chr "Second triceps skinfold (mm)"
#>  $BMPWT : num [1:2107] 6.35 9.45 7.15 10.7 9.35 7.15 8.35 NA 7.35 8.65 ... #> ..- attr(*, "label")= chr "Weight (kg)" #>$ DMPPIR  : num [1:2107] 3.186 1.269 0.416 2.063 1.464 ...
#>   ..- attr(*, "label")= chr "Poverty income ratio"
#>  $HFF1 : Factor w/ 2 levels "1","2": 2 2 1 1 1 2 2 1 2 1 ... #>$ HYD1    : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 1 3 1 1 1 1 1 1 2 1 ...
colSums(is.na(nhanes3_newborn))
#> HSHSIZER  HSAGEIR    HSSEX DMARACER DMAETHNR DMARETHN  BMPHEAD BMPRECUM
#>        0        0        0        0        0        0      124      114
#>   BMPSB1   BMPSB2   BMPTR1   BMPTR2    BMPWT   DMPPIR     HFF1     HYD1
#>      161      169      124      167      117      192        7        0

To impute this dataset, we can use the default settings. The default number of imputed datasets m = 5. Note that we do not need to convert our data into dgCMatrix or one-hot coding format. Our package will convert it automatically. Variables should be of the following types: numeric, integer, factor or ordinal factor.

# use mixgb with default settings
imputed.data <- mixgb(data = nhanes3_newborn, m = 5)

### 2.1 Customise imputation settings

We can also customise imputation settings:

• The number of imputed datasets m

• The number of imputation iterations maxit

• Whether to convert ordinal factors to integer (imputation process will be faster) ordinalAsInteger

• Whether to use bootstrapping bootstrap

• Predictive mean matching settings pmm.type, pmm.k and pmm.link.

• Initial imputation methods for different types of variables initial.num, initial.int and initial.fac.

• Whether to save models for imputing newdata save.models and save.vars.

• XGBoost hyperparameters and verbose settings. xgb.params, nrounds, early_stopping_rounds, print_every_n and verbose.

# Use mixgb with chosen settings
params <- list(max_depth = 6, gamma = 0.1, eta = 0.3, min_child_weight = 1,
subsample = 1, colsample_bytree = 1, colsample_bylevel = 1,
colsample_bynode = 1, nthread = 4, tree_method = "auto",
gpu_id = 0, predictor = "auto")

imputed.data <- mixgb(data = nhanes3_newborn, m = 5, maxit = 1,
ordinalAsInteger = TRUE, bootstrap = TRUE, pmm.type = "auto",
pmm.k = 5, pmm.link = "prob", initial.num = "normal", initial.int = "mode",
initial.fac = "mode", save.models = FALSE, save.vars = NULL,
xgb.params = params, nrounds = 50, early_stopping_rounds = 10,
print_every_n = 10L, verbose = 0)

### 2.2 Tune hyperparameters

Imputation performance can be affected by the hyperparameter settings. It may seem daunting to tune a large set of hyperparameters, but often we can narrow down the search as many hyperparameters are correlated. In our package, we have a function mixgb_cv() to tune nrounds. There is no default nrounds value in XGBoost, so we need to specify it. The default nrounds in mixgb is 50. However, we recommend using mixgb_cv() to find the optimal nrounds first.

cv.results <- mixgb_cv(data = nhanes3_newborn, verbose = FALSE)
cv.results$response #> [1] "BMPSB2" cv.results$best.nrounds
#> [1] 15

By default, mixgb_cv() will randomly choose an incomplete variable as the response and build an XGBoost model with other variables using the complete cases of the dataset. Therefore, each run of mixgb_cv() is likely to return different results. Users can also specify the response and covariates in the argument response and select_features, respectively.

cv.results <- mixgb_cv(data = nhanes3_newborn, nfold = 10, nrounds = 100,
early_stopping_rounds = 1, response = "BMPHEAD", select_features = c("HSAGEIR",
"HSSEX", "DMARETHN", "BMPRECUM", "BMPSB1", "BMPSB2",
"BMPTR1", "BMPTR2", "BMPWT"), verbose = FALSE)

train.imputed <- mixgb.obj$imputed.data # the 5th imputed dataset head(train.imputed[[5]]) #> HSHSIZER HSAGEIR HSSEX DMARACER DMAETHNR DMARETHN BMPHEAD BMPRECUM BMPSB1 #> 1: 7 2 1 1 1 3 43.0 67.1 9.2 #> 2: 4 3 2 2 3 2 42.6 67.1 8.8 #> 3: 3 9 2 2 3 2 46.5 64.3 8.6 #> 4: 3 9 2 1 3 1 46.2 68.5 10.8 #> 5: 5 4 1 1 3 1 44.7 63.0 6.0 #> 6: 5 10 1 1 3 1 45.2 72.0 5.4 #> BMPSB2 BMPTR1 BMPTR2 BMPWT DMPPIR HFF1 HYD1 #> 1: 8.5 8.8 8.8 7.80 1.701 2 1 #> 2: 8.8 13.3 12.2 8.70 0.102 2 1 #> 3: 8.0 10.4 9.2 8.00 0.359 1 3 #> 4: 10.0 16.6 16.0 8.98 0.561 1 3 #> 5: 5.8 9.0 9.0 7.60 2.379 2 1 #> 6: 5.4 9.2 9.4 9.00 2.173 2 2 To impute new data with this saved imputer object, we use the impute_new() function. User can also specify whether to use new data for initial imputation. By default, initial.newdata = FALSE, we will use the information of training data to initially impute the new data. New data will be imputed with the saved models. This process will be considerably faster as we don’t need to build the imputation models again. test.imputed <- impute_new(object = mixgb.obj, newdata = test.data) If PMM is used when we call mixgb(), predicted values of missing entries in the new dataset are matched with donors from training data. Users can also set the number of donors for PMM when imputing new data. By default, pmm.k = NULL , which means the same setting as the training object will be used. Similarly, users can set the number of imputed datasets m. Note that this value has to be smaller than or equal to the m in mixgb(). If it is not specified, it will use the same m value as the saved object. test.imputed <- impute_new(object = mixgb.obj, newdata = test.data, initial.newdata = FALSE, pmm.k = 3, m = 4) ## 5. Install mixgb with GPU support Multiple imputation can be run with GPU support for machines with NVIDIA GPUs. Note that users have to install the R package xgboost with GPU support first. The xgboost R package pre-built binary on Linux x86_64 with GPU support can be downloaded from the release page https://github.com/dmlc/xgboost/releases/tag/v1.4.0 The package can then be installed by running the following commands: # Install dependencies$ R -q -e "install.packages(c('data.table', 'jsonlite'))"

# Install XGBoost
\$ R CMD INSTALL ./xgboost_r_gpu_linux.tar.gz

Then users can install package mixgb in R.

devtools::install_github("agnesdeng/mixgb")

Users just need to specify tree_method = "gpu_list" in the params list which will then be passed to xgb.params in mixgb(). Other GPU-realted arguments include gpu_id and predictor. By default, gpu_id = 0 and predictor = "auto".

params <- list(max_depth = 6, gamma = 0.1, eta = 0.3, min_child_weight = 1,
subsample = 1, colsample_bytree = 1, colsample_bylevel = 1,
colsample_bynode = 1, nthread = 4, tree_method = "gpu_list",
gpu_id = 0, predictor = "auto")

mixgb.data <- mixgb(data = withNA.df, m = 5, xgb.params = params)