Data Preprocessing
mlr offers several options for data preprocessing.
Some of the following simple methods were already mentioned in section Learning tasks:
- capLargeValues: Convert large/infinite numeric values in a data.frame or Task.
- createDummyFeatures: Generate dummy variables for factor features in a data.frame or Task.
- dropFeatures: Remove some features from a Task.
- joinClassLevels: Only for classification: Merge existing classes in a Task to new, larger classes.
- mergeSmallFactorLevels: Merge infrequent levels of factor features in a Task.
- normalizeFeatures: Normalize features in a Task by different methods, e.g., standardization or scaling to a certain range.
- removeConstantFeatures: Remove constant features from a Task.
- subsetTask: Remove observations and/or features from a Task.
Moreover, distinct sections in this tutorial are devoted to
Additionally, mlr permits to fuse a Learner with any preprocessing method of your choice like any kind of data transformation, normalization, dimensionality reduction or outlier removal.
Fusing a learner with data preprocessing
A Learner can be coupled with a preprocessing method by function makePreprocWrapper.
As described in the section about wrapped learners wrappers are implemented using a train and a predict method. In case of preprocessing wrappers these methods specify how to transform the data before training and prediction and are completely defined by the user.
The specified preprocessing steps then "belong" to the wrapped Learner. In contrast to the preprocessing options listed above like normalizeFeatures
- the Task itself remains unchanged,
- the preprocessing is not done globally, i.e., for the whole data set, but for every pair of training/test data sets in e.g. resampling,
- any parameters controlling the preprocessing as, e.g., the percentage of outliers to be removed can be tuned together with the base learner parameters.
Let's see how to create a preprocessing wrapper using the following simple example: Some learning methods as e.g. k nearest neighbors, support vector machines or neural networks usually require scaled features. Many, but not all, have a built-in scaling option where the training data set is scaled before model fitting and the test data set is scaled accordingly, that is by using the scaling parameters from the training stage, before making predictions. In the following we show how to add a scaling option to a Learner by coupling it with function scale.
Specifying the train method
The train function has to be a function with the following arguments:
data
is a data.frame with values of all features and the target variable.target
is a string and denotes the name of the target variable indata
.args
is a list of further arguments and parameters that influence the preprocessing.
It must return a list with elements $data
and $control
,
where $data
is the preprocessed data set and $control
stores all information required
to preprocess the data before prediction.
The train function for the scaling example is given below. It calls scale on the
numerical features and returns the scaled training data and the corresponding scaling parameters.
args
contains the center
and scale
arguments of function scale
and slot $control
stores the scaling parameters.
tr.fun = function(data, target, args = list(center, scale)) {
cns = colnames(data)
nums = setdiff(cns[sapply(data, is.numeric)], target)
x = as.matrix(data[, nums, drop = FALSE])
x = scale(x, center = args$center, scale = args$scale)
ctrl = args
if (is.logical(ctrl$center) && ctrl$center)
ctrl$center = attr(x, "scaled:center")
if (is.logical(ctrl$scale) && ctrl$scale)
ctrl$scale = attr(x, "scaled:scale")
data = data[, setdiff(cns, nums), drop = FALSE]
data = cbind(data, as.data.frame(x))
return(list(data = data, control = ctrl))
}
Specifying the predict method
The predict function has the following arguments:
data
is a data.frame with feature values. (Naturally, it does not contain values of the target variable.)target
is a string indicating the name of the target variable.args
are theargs
that were passed to the train function.control
is the object returned by the train function.
It returns the preprocessed data.
In our running example the predict function scales the numerical features using the
parameters from the training stage stored in control
.
pr.fun = function(data, target, args, control) {
cns = colnames(data)
nums = cns[sapply(data, is.numeric)]
x = as.matrix(data[, nums, drop = FALSE])
x = scale(x, center = control$center, scale = control$scale)
data = data[, setdiff(cns, nums), drop = FALSE]
data = cbind(data, as.data.frame(x))
return(data)
}
Creating the preprocessing wrapper
Below we create a preprocessing wrapper with a regression neural network (which itself does not have a scaling option) as base learner.
The train and predict functions defined above are passed to makePreprocWrapper via
the train
and predict
arguments.
par.vals
is a list of parameter values that is relayed to the args
argument of the train function.
lrn = makeLearner("regr.nnet", trace = FALSE, decay = 1e-02)
lrn = makePreprocWrapper(lrn, train = tr.fun, predict = pr.fun,
par.vals = list(center = TRUE, scale = TRUE))
lrn
#> Learner regr.nnet.preproc from package nnet
#> Type: regr
#> Name: ; Short name:
#> Class: PreprocWrapper
#> Properties: numerics,factors,weights
#> Predict-Type: response
#> Hyperparameters: size=3,trace=FALSE,decay=0.01
Let's compare the cross-validated mean squared error (mse) on the Boston Housing data set with and without scaling.
rdesc = makeResampleDesc("CV", iters = 10)
r = resample(lrn, bh.task, resampling = rdesc, show.info = FALSE)
r$aggr
#> mse.test.mean
#> 20.98447
lrn = makeLearner("regr.nnet", trace = FALSE, decay = 1e-02)
r = resample(lrn, bh.task, resampling = rdesc, show.info = FALSE)
r$aggr
#> mse.test.mean
#> 54.37792