Multiple imputation through XGBoost

Obtain multiply imputed datasets using XGBoost, with an option to save models for imputing new data later on. Users can choose different settings regarding bootstrapping and predictive mean matching as well as XGBoost hyperparameters.

Usage

mixgb(
  data,
  m = 5,
  maxit = 1,
  ordinalAsInteger = FALSE,
  bootstrap = FALSE,
  pmm.type = "auto",
  pmm.k = 5,
  pmm.link = "prob",
  initial.num = "normal",
  initial.int = "mode",
  initial.fac = "mode",
  save.models = FALSE,
  save.vars = NULL,
  verbose = F,
  xgb.params = list(max_depth = 3, gamma = 0, eta = 0.3, min_child_weight = 1,
    subsample = 0.7, colsample_bytree = 1, colsample_bylevel = 1, colsample_bynode = 1,
    tree_method = "auto", gpu_id = 0, predictor = "auto"),
  nrounds = 100,
  early_stopping_rounds = 10,
  print_every_n = 10L,
  xgboost_verbose = 0,
  ...
)

Arguments

data

A data.frame or data.table with missing values

m

The number of imputed datasets. Default: 5

maxit

The number of imputation iterations. Default: 1

ordinalAsInteger

Whether to convert ordinal factors to integers. By default, ordinalAsInteger = FALSE. Setting ordinalAsInteger = TRUE may speed up the imputation process for large datasets.

bootstrap

Whether to use bootstrapping for multiple imputation. By default, bootstrap = FALSE. Setting bootstrap = TRUE would improve imputation variability if sampling-related hyperparameters of XGBoost are set to 1 (default).

pmm.type

The types of predictive mean matching (PMM). Possible values:

NULL: Imputations without PMM;
0: Imputations with PMM type 0;
1: Imputations with PMM type 1;
2: Imputations with PMM type 2;
"auto" (Default): Imputations with PMM type 2 for numeric/integer variables; imputations without PMM for categorical variables.

pmm.k

The number of donors for predictive mean matching. Default: 5

pmm.link

The link for predictive mean matching binary variables

"prob" (Default): use probabilities;
"logit": use logit values.

initial.num

Initial imputation method for numeric type data:

"normal" (Default);
"mean";
"median";
"mode";
"sample".

initial.int

Initial imputation method for integer type data:

"mode" (Default);
"sample".

initial.fac

Initial imputation method for factor type data:

"mode" (Default);
"sample".

save.models

Whether to save models for imputing new data later on. Default: FALSE

save.vars

Response models for variables specified in save.vars will be saved for imputing new data. Can be a vector of names or indices. By default, save.vars = NULL, response models for variables with missing values will be saved. To save all models, please specify save.vars = colnames(data).

verbose

Verbose setting for mixgb. If TRUE, will print out the progress of imputation. Default: FALSE.

xgb.params

A list of XGBoost parameters. For more details, please check XGBoost documentation on parameters.

nrounds

The maximum number of boosting iterations for XGBoost. Default: 100

early_stopping_rounds

An integer value k. XGBoost training will stop if the validation performance hasn't improved for k rounds. Default: 10.

print_every_n

Print XGBoost evaluation information at every nth iteration if xgboost_verbose > 0.

xgboost_verbose

Verbose setting for XGBoost training: 0 (silent), 1 (print information) and 2 (print additional information). Default: 0

...

Extra arguments to pass to XGBoost

Value

If save.models = FALSE, will return a list of m imputed datasets. If save.models = TRUE, will return an object with imputed datasets, saved models and parameters.

Examples

# obtain m multiply datasets without saving models
mixgb.data <- mixgb(data = nhanes3, m = 2)

# obtain m multiply imputed datasets and save models for imputing new data later on
mixgb.obj <- mixgb(data = nhanes3, m = 2, save.models = TRUE)