Integrating another learner
In order to create a new learner in mlr, interface code to the R function must be written. Three functions are required for each learner. First, you must define the learner itself with a name, description, capabilities, parameters, and a few other things. Second, you need to provide a function that calls the learner function and builds the model given data. Finally, a prediction function that returns predicted values given new data is needed.
All learners should inherit from rlearner.classif
, rlearner.regr
,
rlearner.surv
, rlearner.costsens
, or rlearner.cluster
. While it is
also possible to define a new type of learner that has special properties and
does not fit into one of the existing schemes, this is much more advanced and
not covered here.
Classification
We show how the Linear Discriminant Analysis from
package MASS has been integrated
into the classification learner classif.lda
in mlr as an example.
Definition of the learner
The minimal information required to define a learner is the mlr name of the learner, its package, the parameter set, and the set of properties of your learner. In addition, you may provide a human-readable name, a short name and a note with information relevant to users of the learner.
First, name your learner. The naming conventions in mlr are
classif.<R_method_name>
for classification, regr.<R_method_name>
for
regression, surv.<R_method_name>
for survival analysis,
costsens.<R_method_name>
for cost sensitive learning, and
cluster.<R_method_name>
for clustering. So in this example, the name starts with
classif.
and we choose classif.lda
.
Second, we need to define the parameters of the learner. These are any options that can be set when running it to change how it learns, how input is interpreted, how and what output is generated, and so on. mlr provides a number of functions to define parameters, a complete list can be found in the documentation of LearnerParam of the ParamHelpers package.
In our example, we have discrete and numeric parameters, so we use makeDiscreteLearnerParam and makeNumericLearnerParam to incorporate the complete description of the parameters. We include all possible values for discrete parameters and lower and upper bounds for numeric parameters. Strictly speaking it is not necessary to provide bounds for all parameters and if this information is not available they can be estimated, but providing accurate and specific information here makes it possible to tune the learner much better (see the section on tuning).
Next, we add information on the properties of the learner (see also the section
on learners). Which types of features are supported (numerics,
factors)? Are case weights supported? Can the method deal with missing values in
the features and deal with NA's in a meaningful way (not na.omit
)? Are
one-class, two-class, multi-class problems supported? Can the learner predict
posterior probabilities?
Below is the complete code for the definition of the LDA learner. It has one
discrete parameter, method
, and two continuous ones, nu
and tol
. It
supports classification problems with two or more classes and can deal with
numeric and factor explanatory variables. It can predict posterior
probabilities.
makeRLearner.classif.lda = function() {
makeRLearnerClassif(
cl = "classif.lda",
package = "MASS",
par.set = makeParamSet(
makeDiscreteLearnerParam(id = "method", default = "moment", values = c("moment", "mle", "mve", "t")),
makeNumericLearnerParam(id = "nu", lower = 2, requires = expression(method == "t")),
makeNumericLearnerParam(id = "tol", default = 1.0e-4, lower = 0)
),
properties = c("twoclass", "multiclass", "numerics", "factors", "prob")
)
}
Creating the training function of the learner
Once the learner has been defined, we need to tell mlr how to call it to
train a model. The name of the function has to start with trainLearner.
,
followed by the mlr name of the learner as defined above (classif.lda
here). The prototype of the function looks as follows.
function(.learner, .task, .subset, .weights = NULL, ...) { }
This function must fit a model on the data of the task .task
with regard to
the subset defined in the integer vector .subset
and the parameters passed
in the ...
arguments. Usually, the data should be extracted from the task
using getTaskData. This will take care of any subsetting as well. It must
return the fitted model. mlr assumes no special data type for the return
value -- it will be passed to the predict function we are going to define below,
so any special code the learner may need can be encapsulated there.
For our example, the definition of the function looks like this. In addition to the data of the task, we also need the formula that describes what to predict. We use the function getTaskFormula to extract this from the task.
trainLearner.classif.lda = function(.learner, .task, .subset, .weights, ...) {
f = getTaskFormula(.task)
lda(f, data = getTaskData(.task, .subset), ...)
}
Creating the prediction method
Finally, the prediction function needs to be defined. The name of this function
starts with predictLearner.
, followed again by the mlr name of the
learner. The prototype of the function is as follows.
function(.learner, .model, .newdata, ...) { }
It must predict for the new observations in the data.frame
.newdata
with
the wrapped model .model
, which is returned from the training function.
The actual model the learner built is stored in the $learner.model
member
and can be accessed simply through .model$learner.model
.
For classification, you have to return a factor of predicted classes if
.learner$predict.type
is "response"
, or a matrix of predicted
probabilities if .learner$predict.type
is "prob"
and this type of
prediction is supported by the learner. In the latter case the matrix must have
the same number of columns as there are classes in the task and the columns have
to be named by the class names.
The definition for LDA looks like this. It is pretty much just a straight pass-through of the arguments to the predict function and some extraction of prediction data depending on the type of prediction requested.
predictLearner.classif.lda = function(.learner, .model, .newdata, ...) {
p = predict(.model$learner.model, newdata = .newdata, ...)
if(.learner$predict.type == "response")
return(p$class)
else
return(p$posterior)
}
Regression
The main difference for regression is that the type of predictions are different (numeric instead of labels or probabilities) and that not all of the properties are relevant. In particular, whether one-, two-, or multi-class problems and posterior probabilities are supported is not applicable.
Apart from this, everything explained above applies. Below is the definition for the earth learner from the earth package.
makeRLearner.regr.earth = function() {
makeRLearnerRegr(
cl = "regr.earth",
package = "earth",
par.set = makeParamSet(
makeIntegerLearnerParam(id = "degree", default = 1L, lower = 1L),
makeNumericLearnerParam(id = "penalty"),
makeIntegerLearnerParam(id = "nprune")
),
properties = c("numerics", "factors"),
name = "Multivariate Adaptive Regression Splines",
short.name = "earth",
note = ""
)
}
trainLearner.regr.earth = function(.learner, .task, .subset, .weights = NULL, ...) {
f = getTaskFormula(.task)
earth::earth(f, data = getTaskData(.task, .subset), ...)
}
predictLearner.regr.earth = function(.learner, .model, .newdata, ...) {
predict(.model$learner.model, newdata = .newdata)[, 1L]
}
Again most of the data is passed straight through to/from the train/predict functions of the learner.
Survival Analysis
For survival analysis, you have to return so-called linear predictors in order to compute
the default measure for this task type, the cindex (for
.learner$predict.type
== "response"
). For .learner$predict.type
== "prob"
,
there is no substantially meaningful measure (yet). You may either ignore this case or return
something like predicted survival curves (cf. example below).
There are three properties that are specific to survival learners: "rcens", "lcens" and "icens", defining the type(s) of censoring a learner can handle -- right, left and/or interval censored.
Let's have a look at how the Cox Proportional Hazard Model from
package survival has been integrated
into the survival learner surv.coxph
in mlr as an example:
makeRLearner.surv.coxph = function() {
makeRLearnerSurv(
cl = "surv.coxph",
package = "survival",
par.set = makeParamSet(
makeDiscreteLearnerParam(id = "ties", default = "efron", values = c("efron", "breslow", "exact")),
makeLogicalLearnerParam(id = "singular.ok", default = TRUE),
makeNumericLearnerParam(id = "eps", default = 1e-09, lower = 0),
makeNumericLearnerParam(id = "toler.chol", default = .Machine$double.eps^0.75, lower = 0),
makeIntegerLearnerParam(id = "iter.max", default = 20L, lower = 1L),
makeNumericLearnerParam(id = "toler.inf", default = sqrt(.Machine$double.eps^0.75), lower = 0),
makeIntegerLearnerParam(id = "outer.max", default = 10L, lower = 1L)
),
properties = c("missings", "numerics", "factors", "weights", "prob", "rcens"),
name = "Cox Proportional Hazard Model",
short.name = "coxph",
note = ""
)
}
trainLearner.surv.coxph = function(.learner, .task, .subset, .weights = NULL, ...) {
f = as.formula(getTaskFormulaAsString(.task))
data = getTaskData(.task, subset = .subset)
if (is.null(.weights)) {
mod = survival::coxph(formula = f, data = data, ...)
} else {
mod = survival::coxph(formula = f, data = data, weights = .weights, ...)
}
if (.learner$predict.type == "prob")
mod = attachTrainingInfo(mod, list(surv.range = range(getTaskTargets(.task)[, 1L])))
mod
}
predictLearner.surv.coxph = function(.learner, .model, .newdata, ...) {
if(.learner$predict.type == "response") {
predict(.model$learner.model, newdata = .newdata, type = "lp", ...)
} else if (.learner$predict.type == "prob") {
surv.range = getTrainingInfo(.model$learner.model)$surv.range
times = seq(from = surv.range[1L], to = surv.range[2L], length.out = 1000)
t(summary(survival::survfit(.model$learner.model, newdata = .newdata, se.fit = FALSE, conf.int = FALSE), times = times)$surv)
} else {
stop("Unknown predict type")
}
}
Clustering
For clustering, you have to return a numeric vector with the IDs of the clusters that the respective datum has been assigned to. The numbering should start at 1.
Below is the definition for the FarthestFirst learner
from the RWeka package. Weka
starts the IDs of the clusters at 0, so we add 1 to the predicted clusters.
RWeka has a different way of setting learner parameters; we use the special
Weka_control
function to do this.
makeRLearner.cluster.FarthestFirst = function() {
makeRLearnerCluster(
cl = "cluster.FarthestFirst",
package = "RWeka",
par.set = makeParamSet(
makeIntegerLearnerParam(id = "N", default = 2L, lower = 1L),
makeIntegerLearnerParam(id = "S", default = 1L, lower = 1L)
),
properties = c("numerics"),
name = "FarthestFirst Clustering Algorithm",
short.name = "farthestfirst"
)
}
trainLearner.cluster.FarthestFirst = function(.learner, .task, .subset, .weights = NULL, ...) {
ctrl = RWeka::Weka_control(...)
RWeka::FarthestFirst(getTaskData(.task, .subset), control = ctrl)
}
predictLearner.cluster.FarthestFirst = function(.learner, .model, .newdata, ...) {
predict(.model$learner.model, .newdata, ...) + 1
}