Tuning Hyperparameters

Many machine learning algorithms have hyperparameters that need to be set. If selected by the user they can be specified as explained in the section about Learners -- simply pass them to makeLearner. Often suitable parameter values are not obvious and it is preferable to tune the hyperparameters, that is automatically identify values that lead to the best performance.

Basics

For tuning you have to specify

The last point is already covered in this tutorial in the parts about the evaluation of learning methods and resampling.

Below we show how to specify the search space and optimization algorithm, how to do the tuning and how to access the tuning result, using the example of a grid search.

Throughout this section we consider classification examples. For the other types of learning problems tuning works analogously.

Grid search with manual discretization

A grid search is one of the standard -- albeit slow -- ways to choose an appropriate set of parameters from a given range of values.

We use the iris classification task for illustration and tune the hyperparameters of an SVM (function ksvm from the kernlab package) with a radial basis kernel.

First, we create a ParamSet object, which describes the parameter space we wish to search. This is done via function makeParamSet. We wish to tune the cost parameter C and the RBF kernel parameter sigma of the ksvm function. Since we will use a grid search strategy, we add discrete parameters to the parameter set. The specified values have to be vectors of feasible settings and the complete grid simply is their cross-product. Every entry in the parameter set has to be named according to the corresponding parameter of the underlying R function.

Please note that whenever parameters in the underlying R functions should be passed in a list structure, mlr tries to give you direct access to each parameter and get rid of the list structure. This is the case with the kpar argument of ksvm which is a list of kernel parameters like sigma.

ps = makeParamSet(
  makeDiscreteParam("C", values = 2^(-2:2)),
  makeDiscreteParam("sigma", values = 2^(-2:2))
)

Additional to the parameter set, we need an instance of a TuneControl object. These describe the optimization strategy to be used and its settings. Here we choose a grid search:

ctrl = makeTuneControlGrid()

We will use 3-fold cross-validation to assess the quality of a specific parameter setting. For this we need to create a resampling description just like in the resampling part of the tutorial.

rdesc = makeResampleDesc("CV", iters = 3L)

Finally, by combining all the previous pieces, we can tune the SVM parameters by calling tuneParams.

res = tuneParams("classif.ksvm", task = iris.task, resampling = rdesc, par.set = ps,
  control = ctrl)
#> [Tune] Started tuning learner classif.ksvm for parameter set:
#>           Type len Def         Constr Req Tunable Trafo
#> C     discrete   -   - 0.25,0.5,1,2,4   -    TRUE     -
#> sigma discrete   -   - 0.25,0.5,1,2,4   -    TRUE     -
#> With control class: TuneControlGrid
#> Imputation value: 1
#> [Tune-x] 1: C=0.25; sigma=0.25
#> [Tune-y] 1: mmce.test.mean=0.0467; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 2: C=0.5; sigma=0.25
#> [Tune-y] 2: mmce.test.mean=0.0467; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 3: C=1; sigma=0.25
#> [Tune-y] 3: mmce.test.mean=0.04; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 4: C=2; sigma=0.25
#> [Tune-y] 4: mmce.test.mean=0.0467; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 5: C=4; sigma=0.25
#> [Tune-y] 5: mmce.test.mean=0.0467; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 6: C=0.25; sigma=0.5
#> [Tune-y] 6: mmce.test.mean=0.06; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 7: C=0.5; sigma=0.5
#> [Tune-y] 7: mmce.test.mean=0.04; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 8: C=1; sigma=0.5
#> [Tune-y] 8: mmce.test.mean=0.04; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 9: C=2; sigma=0.5
#> [Tune-y] 9: mmce.test.mean=0.0467; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 10: C=4; sigma=0.5
#> [Tune-y] 10: mmce.test.mean=0.0467; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 11: C=0.25; sigma=1
#> [Tune-y] 11: mmce.test.mean=0.0533; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 12: C=0.5; sigma=1
#> [Tune-y] 12: mmce.test.mean=0.04; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 13: C=1; sigma=1
#> [Tune-y] 13: mmce.test.mean=0.0467; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 14: C=2; sigma=1
#> [Tune-y] 14: mmce.test.mean=0.0467; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 15: C=4; sigma=1
#> [Tune-y] 15: mmce.test.mean=0.0533; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 16: C=0.25; sigma=2
#> [Tune-y] 16: mmce.test.mean=0.0533; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 17: C=0.5; sigma=2
#> [Tune-y] 17: mmce.test.mean=0.04; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 18: C=1; sigma=2
#> [Tune-y] 18: mmce.test.mean=0.0333; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 19: C=2; sigma=2
#> [Tune-y] 19: mmce.test.mean=0.04; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 20: C=4; sigma=2
#> [Tune-y] 20: mmce.test.mean=0.0467; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 21: C=0.25; sigma=4
#> [Tune-y] 21: mmce.test.mean=0.113; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 22: C=0.5; sigma=4
#> [Tune-y] 22: mmce.test.mean=0.0667; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 23: C=1; sigma=4
#> [Tune-y] 23: mmce.test.mean=0.0533; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 24: C=2; sigma=4
#> [Tune-y] 24: mmce.test.mean=0.06; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 25: C=4; sigma=4
#> [Tune-y] 25: mmce.test.mean=0.0667; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune] Result: C=1; sigma=2 : mmce.test.mean=0.0333
res
#> Tune result:
#> Op. pars: C=1; sigma=2
#> mmce.test.mean=0.0333

tuneParams simply performs the cross-validation for every element of the cross-product and selects the parameter setting with the best mean performance. As no performance measure was specified, by default the error rate (mmce) is used.

Note that each measure "knows" if it is minimized or maximized during tuning.

## error rate
mmce$minimize
#> [1] TRUE

## accuracy
acc$minimize
#> [1] FALSE

Of course, you can pass other measures and also a list of measures to tuneParams. In the latter case the first measure is optimized during tuning, the others are simply evaluated. If you are interested in optimizing several measures simultaneously have a look at the paragraph about multi-criteria tuning below.

In the example below we calculate the accuracy (acc) instead of the error rate. We use function setAggregation, as described in the section on resampling, to additionally obtain the standard deviation of the accuracy.

res = tuneParams("classif.ksvm", task = iris.task, resampling = rdesc, par.set = ps,
  control = ctrl, measures = list(acc, setAggregation(acc, test.sd)), show.info = FALSE)
res
#> Tune result:
#> Op. pars: C=0.25; sigma=0.25
#> acc.test.mean=0.953,acc.test.sd=0.0306

Accessing the tuning result

The result object TuneResult allows you to access the best found settings $x and their estimated performance $y.

res$x
#> $C
#> [1] 0.25
#> 
#> $sigma
#> [1] 0.25
res$y
#> acc.test.mean   acc.test.sd 
#>     0.9533333     0.0305505

Moreover, we can inspect all points evaluated during the search by accessing the $opt.path (see also the documentation of OptPath).

res$opt.path
#> Optimization path
#>   Dimensions: x = 2/2, y = 2
#>   Length: 25
#>   Add x values transformed: FALSE
#>   Error messages: TRUE. Errors: 0 / 25.
#>   Exec times: TRUE. Range: 0.092 - 0.172. 0 NAs.
opt.grid = as.data.frame(res$opt.path)
head(opt.grid)
#>      C sigma acc.test.mean acc.test.sd dob eol error.message exec.time
#> 1 0.25  0.25     0.9533333  0.03055050   1  NA          <NA>     0.102
#> 2  0.5  0.25     0.9466667  0.02309401   2  NA          <NA>     0.093
#> 3    1  0.25     0.9533333  0.01154701   3  NA          <NA>     0.110
#> 4    2  0.25     0.9533333  0.01154701   4  NA          <NA>     0.101
#> 5    4  0.25     0.9533333  0.01154701   5  NA          <NA>     0.094
#> 6 0.25   0.5     0.9333333  0.01154701   6  NA          <NA>     0.098

A quick visualization of the performance values on the search grid can be accomplished as follows:

library(ggplot2)
g = ggplot(opt.grid, aes(x = C, y = sigma, fill = acc.test.mean, label = round(acc.test.sd, 3)))
g + geom_tile() + geom_text(color = "white")

plot of chunk tune_gridSearchVisualized

The colors of the tiles display the achieved accuracy, the tile labels show the standard deviation.

Using the optimal parameter values

After tuning you can generate a Learner with optimal hyperparameter settings as follows:

lrn = setHyperPars(makeLearner("classif.ksvm"), par.vals = res$x)
lrn
#> Learner classif.ksvm from package kernlab
#> Type: classif
#> Name: Support Vector Machines; Short name: ksvm
#> Class: classif.ksvm
#> Properties: twoclass,multiclass,numerics,factors,prob,class.weights
#> Predict-Type: response
#> Hyperparameters: fit=FALSE,C=0.25,sigma=0.25

Then you can proceed as usual. Here we refit and predict the learner on the complete iris data set.

m = train(lrn, iris.task)
predict(m, task = iris.task)
#> Prediction: 150 observations
#> predict.type: response
#> threshold: 
#> time: 0.01
#>   id  truth response
#> 1  1 setosa   setosa
#> 2  2 setosa   setosa
#> 3  3 setosa   setosa
#> 4  4 setosa   setosa
#> 5  5 setosa   setosa
#> 6  6 setosa   setosa

Grid search without manual discretization

We can also specify the true numeric parameter types of C and sigma when creating the parameter set and use the resolution option of makeTuneControlGrid to automatically discretize them.

Note how we also make use of the trafo option when creating the parameter set to easily optimize on a log-scale.

Trafos work like this: All optimizers basically see the parameters on their original scale (from -12 to 12) in this case and produce values on this scale during the search. Right before they are passed to the learning algorithm, the transformation function is applied.

ps = makeParamSet(
  makeNumericParam("C", lower = -12, upper = 12, trafo = function(x) 2^x),
  makeNumericParam("sigma", lower = -12, upper = 12, trafo = function(x) 2^x)
)
ctrl = makeTuneControlGrid(resolution = 3L)
rdesc = makeResampleDesc("CV", iters = 2L)
res = tuneParams("classif.ksvm", iris.task, rdesc, par.set = ps, control = ctrl)
#> [Tune] Started tuning learner classif.ksvm for parameter set:
#>          Type len Def    Constr Req Tunable Trafo
#> C     numeric   -   - -12 to 12   -    TRUE     Y
#> sigma numeric   -   - -12 to 12   -    TRUE     Y
#> With control class: TuneControlGrid
#> Imputation value: 1
#> [Tune-x] 1: C=0.000244; sigma=0.000244
#> [Tune-y] 1: mmce.test.mean=0.527; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 2: C=1; sigma=0.000244
#> [Tune-y] 2: mmce.test.mean=0.527; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 3: C=4.1e+03; sigma=0.000244
#> [Tune-y] 3: mmce.test.mean=0.04; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 4: C=0.000244; sigma=1
#> [Tune-y] 4: mmce.test.mean=0.527; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 5: C=1; sigma=1
#> [Tune-y] 5: mmce.test.mean=0.04; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 6: C=4.1e+03; sigma=1
#> [Tune-y] 6: mmce.test.mean=0.0667; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 7: C=0.000244; sigma=4.1e+03
#> [Tune-y] 7: mmce.test.mean=0.567; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 8: C=1; sigma=4.1e+03
#> [Tune-y] 8: mmce.test.mean=0.687; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune-x] 9: C=4.1e+03; sigma=4.1e+03
#> [Tune-y] 9: mmce.test.mean=0.687; time: 0.0 min; memory: 144Mb use, 244Mb max
#> [Tune] Result: C=1; sigma=1 : mmce.test.mean=0.04
res
#> Tune result:
#> Op. pars: C=1; sigma=1
#> mmce.test.mean=0.04

Note that res$opt.path contains the parameter values on the original scale.

as.data.frame(res$opt.path)
#>     C sigma mmce.test.mean dob eol error.message exec.time
#> 1 -12   -12     0.52666667   1  NA          <NA>     0.068
#> 2   0   -12     0.52666667   2  NA          <NA>     0.065
#> 3  12   -12     0.04000000   3  NA          <NA>     0.068
#> 4 -12     0     0.52666667   4  NA          <NA>     0.076
#> 5   0     0     0.04000000   5  NA          <NA>     0.067
#> 6  12     0     0.06666667   6  NA          <NA>     0.065
#> 7 -12    12     0.56666667   7  NA          <NA>     0.076
#> 8   0    12     0.68666667   8  NA          <NA>     0.068
#> 9  12    12     0.68666667   9  NA          <NA>     0.084

In order to get the transformed parameter values instead, use function trafoOptPath.

as.data.frame(trafoOptPath(res$opt.path))
#>              C        sigma mmce.test.mean dob eol
#> 1 2.441406e-04 2.441406e-04     0.52666667   1  NA
#> 2 1.000000e+00 2.441406e-04     0.52666667   2  NA
#> 3 4.096000e+03 2.441406e-04     0.04000000   3  NA
#> 4 2.441406e-04 1.000000e+00     0.52666667   4  NA
#> 5 1.000000e+00 1.000000e+00     0.04000000   5  NA
#> 6 4.096000e+03 1.000000e+00     0.06666667   6  NA
#> 7 2.441406e-04 4.096000e+03     0.56666667   7  NA
#> 8 1.000000e+00 4.096000e+03     0.68666667   8  NA
#> 9 4.096000e+03 4.096000e+03     0.68666667   9  NA

Iterated F-Racing for mixed spaces and dependencies

The package supports a larger number of tuning algorithms, which can all be looked up and selected via TuneControl. One of the cooler algorithms is iterated F-racing from the irace package. This not only works for arbitrary parameter types (numeric, integer, discrete, logical), but also for so-called dependent / hierarchical parameters:

ps = makeParamSet(
  makeNumericParam("C", lower = -12, upper = 12, trafo = function(x) 2^x),
  makeDiscreteParam("kernel", values = c("vanilladot", "polydot", "rbfdot")),
  makeNumericParam("sigma", lower = -12, upper = 12, trafo = function(x) 2^x,
    requires = quote(kernel == "rbfdot")),
  makeIntegerParam("degree", lower = 2L, upper = 5L,
    requires = quote(kernel == "polydot"))
)
ctrl = makeTuneControlIrace(maxExperiments = 200L)
rdesc = makeResampleDesc("Holdout")
res = tuneParams("classif.ksvm", iris.task, rdesc, par.set = ps, control = ctrl, show.info = FALSE)
print(head(as.data.frame(res$opt.path)))
#>           C     kernel    sigma degree mmce.test.mean dob eol
#> 1  3.148837    polydot       NA      5           0.08   1  NA
#> 2  3.266305 vanilladot       NA     NA           0.02   2  NA
#> 3 -3.808213 vanilladot       NA     NA           0.04   3  NA
#> 4  1.694097     rbfdot 6.580514     NA           0.48   4  NA
#> 5 11.995501    polydot       NA      2           0.08   5  NA
#> 6 -5.731782 vanilladot       NA     NA           0.14   6  NA
#>   error.message exec.time
#> 1          <NA>     0.051
#> 2          <NA>     0.038
#> 3          <NA>     0.041
#> 4          <NA>     0.060
#> 5          <NA>     0.048
#> 6          <NA>     0.037

See how we made the kernel parameters like sigma and degree dependent on the kernel selection parameters? This approach allows you to tune parameters of multiple kernels at once, efficiently concentrating on the ones which work best for your given data set.

Tuning across whole model spaces with ModelMultiplexer

We can now take the following example even one step further. If we use the ModelMultiplexer we can tune over different model classes at once, just as we did with the SVM kernels above.

base.learners = list(
  makeLearner("classif.ksvm"),
  makeLearner("classif.randomForest")
)
lrn = makeModelMultiplexer(base.learners)

Function makeModelMultiplexerParamSet offers a simple way to contruct parameter set for tuning: The parameter names are prefixed automatically and the requires element is set, too, to make all paramaters subordinate to selected.learner.

ps = makeModelMultiplexerParamSet(lrn,
  makeNumericParam("sigma", lower = -12, upper = 12, trafo = function(x) 2^x),
  makeIntegerParam("ntree", lower = 1L, upper = 500L)
)
print(ps)
#>                                Type len Def
#> selected.learner           discrete   -   -
#> classif.ksvm.sigma          numeric   -   -
#> classif.randomForest.ntree  integer   -   -
#>                                                       Constr Req Tunable
#> selected.learner           classif.ksvm,classif.randomForest   -    TRUE
#> classif.ksvm.sigma                                 -12 to 12   Y    TRUE
#> classif.randomForest.ntree                          1 to 500   Y    TRUE
#>                            Trafo
#> selected.learner               -
#> classif.ksvm.sigma             Y
#> classif.randomForest.ntree     -
rdesc = makeResampleDesc("CV", iters = 2L)
ctrl = makeTuneControlIrace(maxExperiments = 200L)
res = tuneParams(lrn, iris.task, rdesc, par.set = ps, control = ctrl, show.info = FALSE)
print(head(as.data.frame(res$opt.path)))
#>       selected.learner classif.ksvm.sigma classif.randomForest.ntree
#> 1         classif.ksvm          -3.673815                         NA
#> 2         classif.ksvm           6.361006                         NA
#> 3 classif.randomForest                 NA                        487
#> 4         classif.ksvm           3.165340                         NA
#> 5 classif.randomForest                 NA                        125
#> 6 classif.randomForest                 NA                        383
#>   mmce.test.mean dob eol error.message exec.time
#> 1     0.04666667   1  NA          <NA>     0.073
#> 2     0.75333333   2  NA          <NA>     0.112
#> 3     0.03333333   3  NA          <NA>     0.156
#> 4     0.24000000   4  NA          <NA>     0.074
#> 5     0.04000000   5  NA          <NA>     0.067
#> 6     0.04000000   6  NA          <NA>     0.097

Multi-criteria evaluation and optimization

During tuning you might want to optimize multiple, potentially conflicting, performance measures simultaneously.

In the following example we aim to minimize both, the false positive and the false negative rates (fpr and fnr). We again tune the hyperparameters of an SVM (function ksvm) with a radial basis kernel and use the sonar classification task for illustration. As search strategy we choose a random search.

For all available multi-criteria tuning algorithms see TuneMultiCritControl.

ps = makeParamSet(
  makeNumericParam("C", lower = -12, upper = 12, trafo = function(x) 2^x),
  makeNumericParam("sigma", lower = -12, upper = 12, trafo = function(x) 2^x)
)
ctrl = makeTuneMultiCritControlRandom(maxit = 30L)
rdesc = makeResampleDesc("Holdout")
res = tuneParamsMultiCrit("classif.ksvm", task = sonar.task, resampling = rdesc, par.set = ps,
  measures = list(fpr, fnr), control = ctrl, show.info = FALSE)
res
#> Tune multicrit result:
#> Points on front: 5
head(as.data.frame(trafoOptPath(res$opt.path)))
#>              C        sigma fpr.test.mean fnr.test.mean dob eol
#> 1 1.052637e-01  0.003374481     0.0000000    1.00000000   1  NA
#> 2 1.612578e+02 14.303163917     0.0000000    1.00000000   2  NA
#> 3 3.697931e+03  0.026982462     0.1851852    0.06976744   3  NA
#> 4 2.331471e+02 11.791412207     0.0000000    1.00000000   4  NA
#> 5 2.078857e-02  0.010218565     0.0000000    1.00000000   5  NA
#> 6 3.382767e+02  2.187025359     0.0000000    1.00000000   6  NA

The results can be visualized with function plotTuneMultiCritResult. The plot shows the false positive and false negative rates for all parameter settings evaluated during tuning. Points on the Pareto front are slightly increased.

plotTuneMultiCritResult(res)

plot of chunk unnamed-chunk-18

Further comments