Obtain multiply imputed datasets using XGBoost, with an option to save models for imputing new data later on. Users can choose different settings regarding bootstrapping and predictive mean matching as well as XGBoost hyperparameters.
Usage
mixgb(
data,
m = 5,
maxit = 1,
ordinalAsInteger = FALSE,
bootstrap = FALSE,
pmm.type = "auto",
pmm.k = 5,
pmm.link = "prob",
initial.num = "normal",
initial.int = "mode",
initial.fac = "mode",
save.models = FALSE,
save.vars = NULL,
verbose = F,
xgb.params = list(max_depth = 3, gamma = 0, eta = 0.3, min_child_weight = 1,
subsample = 0.7, colsample_bytree = 1, colsample_bylevel = 1, colsample_bynode = 1,
tree_method = "auto", gpu_id = 0, predictor = "auto"),
nrounds = 100,
early_stopping_rounds = 10,
print_every_n = 10L,
xgboost_verbose = 0,
...
)
Arguments
- data
A data.frame or data.table with missing values
- m
The number of imputed datasets. Default: 5
- maxit
The number of imputation iterations. Default: 1
- ordinalAsInteger
Whether to convert ordinal factors to integers. By default,
ordinalAsInteger = FALSE
. SettingordinalAsInteger = TRUE
may speed up the imputation process for large datasets.- bootstrap
Whether to use bootstrapping for multiple imputation. By default,
bootstrap = FALSE
. Settingbootstrap = TRUE
would improve imputation variability if sampling-related hyperparameters of XGBoost are set to 1 (default).- pmm.type
The types of predictive mean matching (PMM). Possible values:
NULL
: Imputations without PMM;0
: Imputations with PMM type 0;1
: Imputations with PMM type 1;2
: Imputations with PMM type 2;"auto"
(Default): Imputations with PMM type 2 for numeric/integer variables; imputations without PMM for categorical variables.
- pmm.k
The number of donors for predictive mean matching. Default: 5
- pmm.link
The link for predictive mean matching binary variables
"prob"
(Default): use probabilities;"logit"
: use logit values.
- initial.num
Initial imputation method for numeric type data:
"normal"
(Default);"mean"
;"median"
;"mode"
;"sample"
.
- initial.int
Initial imputation method for integer type data:
"mode"
(Default);"sample"
.
- initial.fac
Initial imputation method for factor type data:
"mode"
(Default);"sample"
.
- save.models
Whether to save models for imputing new data later on. Default:
FALSE
- save.vars
Response models for variables specified in
save.vars
will be saved for imputing new data. Can be a vector of names or indices. By default,save.vars = NULL
, response models for variables with missing values will be saved. To save all models, please specifysave.vars = colnames(data)
.- verbose
Verbose setting for mixgb. If
TRUE
, will print out the progress of imputation. Default:FALSE
.- xgb.params
A list of XGBoost parameters. For more details, please check XGBoost documentation on parameters.
- nrounds
The maximum number of boosting iterations for XGBoost. Default: 100
- early_stopping_rounds
An integer value
k
. XGBoost training will stop if the validation performance hasn't improved fork
rounds. Default: 10.- print_every_n
Print XGBoost evaluation information at every nth iteration if
xgboost_verbose > 0
.- xgboost_verbose
Verbose setting for XGBoost training: 0 (silent), 1 (print information) and 2 (print additional information). Default: 0
- ...
Extra arguments to pass to XGBoost