Benchmark Experiments
In a benchmark experiment different learning methods are applied to one or several data sets with the aim to compare and rank the algorithms with respect to one or more performance measures.
In mlr a benchmark experiment can be conducted by calling function benchmark on a list of Learners and a list of Tasks. benchmark basically executes resample for each combination of Learner and Task. You can specify an individual resampling strategy for each Task and select one or multiple performance measures to be calculated.
Example: One task, two learners, prediction on a single test set
We start with a small example. Two learners, linear discriminant analysis and
classification trees, are applied to one classification problem (sonar.task).
As resampling strategy we choose "Holdout"
.
The performance is thus calculated on one single randomly sampled test data set.
In the example below we create a resample description (ResampleDesc), which is automatically instantiated by benchmark. The instantiation is done only once, that is, the same resample instance (ResampleInstance) is used for all learners applied to the Task. It's also possible to directly pass a ResampleInstance.
If you would like to use a fixed test data set instead of a randomly selected one, you can create a suitable ResampleInstance through function makeFixedHoldoutInstance.
## Two learners to be compared
lrns = list(makeLearner("classif.lda"), makeLearner("classif.rpart"))
## Choose the resampling strategy
rdesc = makeResampleDesc("Holdout")
## Conduct the benchmark experiment
res = benchmark(lrns, sonar.task, rdesc)
#> Task: Sonar-example, Learner: classif.lda
#> [Resample] holdout iter: 1
#> [Resample] Result: mmce.test.mean= 0.3
#> Task: Sonar-example, Learner: classif.rpart
#> [Resample] holdout iter: 1
#> [Resample] Result: mmce.test.mean=0.286
res
#> task.id learner.id mmce.test.mean
#> 1 Sonar-example classif.lda 0.3000000
#> 2 Sonar-example classif.rpart 0.2857143
In the printed table every row corresponds to one pair of Task and Learner. The entries show the mean misclassification error (mmce), the default performance measure for classification, on the test data set.
The result res
is an object of class BenchmarkResult. Basically, this is a list
of lists of ResampleResult objects, first ordered by Task and then by Learner.
mlr provides several accessor functions, named getBMR<what_to_extract>
, that permit
to retrieve information for further analyses. This includes for example the performances
or predictions of the learning algorithms under consideration.
Let's have a look at the benchmark result above. getBMRPerformances returns individual performances in resampling runs, while getBMRAggrPerformances gives the aggregated values.
getBMRPerformances(res)
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#> iter mmce
#> 1 1 0.3
#>
#> $`Sonar-example`$classif.rpart
#> iter mmce
#> 1 1 0.2857143
getBMRAggrPerformances(res)
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#> mmce.test.mean
#> 0.3
#>
#> $`Sonar-example`$classif.rpart
#> mmce.test.mean
#> 0.2857143
Since we used holdout as resampling strategy, individual and aggregated performances are the same.
Often it is more convenient to work with data.frames. You can easily
convert the result structure by setting as.df = TRUE
.
getBMRPerformances(res, as.df = TRUE)
#> task.id learner.id iter mmce
#> 1 Sonar-example classif.lda 1 0.3000000
#> 2 Sonar-example classif.rpart 1 0.2857143
getBMRAggrPerformances(res, as.df = TRUE)
#> task.id learner.id mmce.test.mean
#> 1 Sonar-example classif.lda 0.3000000
#> 2 Sonar-example classif.rpart 0.2857143
Function getBMRPredictions returns the predictions. Per default, you get a list of lists of ResamplePrediction objects. In most cases you might prefer the data.frame version.
getBMRPredictions(res)
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#> Resampled Prediction for:
#> Resample description: holdout with 0.67 split rate.
#> Predict: test
#> Stratification: FALSE
#> predict.type: response
#> threshold:
#> time (mean): 0.01
#> id truth response iter set
#> 180 180 M M 1 test
#> 100 100 M R 1 test
#> 53 53 R M 1 test
#> 89 89 R R 1 test
#> 92 92 R M 1 test
#> 11 11 R R 1 test
#>
#> $`Sonar-example`$classif.rpart
#> Resampled Prediction for:
#> Resample description: holdout with 0.67 split rate.
#> Predict: test
#> Stratification: FALSE
#> predict.type: response
#> threshold:
#> time (mean): 0.01
#> id truth response iter set
#> 180 180 M M 1 test
#> 100 100 M M 1 test
#> 53 53 R R 1 test
#> 89 89 R M 1 test
#> 92 92 R M 1 test
#> 11 11 R R 1 test
head(getBMRPredictions(res, as.df = TRUE))
#> task.id learner.id id truth response iter set
#> 1 Sonar-example classif.lda 180 M M 1 test
#> 2 Sonar-example classif.lda 100 M R 1 test
#> 3 Sonar-example classif.lda 53 R M 1 test
#> 4 Sonar-example classif.lda 89 R R 1 test
#> 5 Sonar-example classif.lda 92 R M 1 test
#> 6 Sonar-example classif.lda 11 R R 1 test
It is also easily possible to access results for certain learners or tasks via their
IDs. Nearly all "getter" functions have a learner.ids
and a task.ids
argument.
head(getBMRPredictions(res, learner.ids = "classif.rpart", as.df = TRUE))
#> task.id learner.id id truth response iter set
#> 180 Sonar-example classif.rpart 180 M M 1 test
#> 100 Sonar-example classif.rpart 100 M M 1 test
#> 53 Sonar-example classif.rpart 53 R R 1 test
#> 89 Sonar-example classif.rpart 89 R M 1 test
#> 92 Sonar-example classif.rpart 92 R M 1 test
#> 11 Sonar-example classif.rpart 11 R R 1 test
As you might recall, you can use the id
option in makeLearner or function setId
to set the ID of a Learner and the id
option of make*Task for
Task IDs.
The IDs of all Learners and Tasks in a benchmark experiment can be retrieved as follows:
getBMRTaskIds(res)
#> [1] "Sonar-example"
getBMRLearnerIds(res)
#> [1] "classif.lda" "classif.rpart"
Example: Two tasks, three learners, bootstrapping
Let's have a look at a larger benchmark experiment with two classification tasks (pid.task and sonar.task) and three learning algorithms. Since the default learner IDs are a little long, we choose shorter names.
For both tasks bootstrapping with 20 iterations is chosen as resampling strategy. This is achieved by passing a single resample description to benchmark, which is then instantiated automatically once for each Task. Thus, the same instance is used for all learners applied to one task.
It's also possible to choose a different resampling strategy for each Task by passing a list of the same length as the number of tasks that can contain both ResampleDescs and ResampleInstances.
In the example below the accuracy (acc) and the area under curve (auc) are calculated.
## Three learners to be compared
lrns = list(makeLearner("classif.lda", predict.type = "prob", id = "lda"),
makeLearner("classif.rpart", predict.type = "prob", id = "rpart"),
makeLearner("classif.randomForest", predict.type = "prob", id = "rF"))
## Two classification tasks
tasks = list(pid.task, sonar.task)
## Use bootstrapping for both tasks
rdesc = makeResampleDesc("Bootstrap", iters = 20)
## Conduct the benchmark experiment
res = benchmark(lrns, tasks, rdesc, measures = list(acc, auc), show.info = FALSE)
res
#> task.id learner.id acc.test.mean auc.test.mean
#> 1 PimaIndiansDiabetes-example lda 0.7677489 0.8237077
#> 2 PimaIndiansDiabetes-example rpart 0.7386807 0.7367115
#> 3 PimaIndiansDiabetes-example rF 0.7600432 0.8166053
#> 4 Sonar-example lda 0.7137996 0.7755366
#> 5 Sonar-example rpart 0.6985412 0.7331386
#> 6 Sonar-example rF 0.8156073 0.9184084
The entries in the printed table show the aggregated accuracies and AUC values. On the Pima data lda and random forest show nearly identical performance. On the sonar example random forest has highest accuracy and AUC.
Instead of just comparing mean performance values it's generally preferable to have a look at the distribution of performance values obtained in individual resampling runs. The individual performances on the 20 bootstrap iterations for every task and learner are retrieved below.
perf = getBMRPerformances(res, as.df = TRUE)
head(perf)
#> task.id learner.id iter acc auc
#> 1 PimaIndiansDiabetes-example lda 1 0.7065217 0.7644810
#> 2 PimaIndiansDiabetes-example lda 2 0.7789116 0.8158948
#> 3 PimaIndiansDiabetes-example lda 3 0.7977941 0.8517787
#> 4 PimaIndiansDiabetes-example lda 4 0.7781818 0.8162754
#> 5 PimaIndiansDiabetes-example lda 5 0.7403509 0.7966082
#> 6 PimaIndiansDiabetes-example lda 6 0.7681661 0.8356029
As part of a first exploratory analysis you might want to create some plots, for example dotplots, boxplots, densityplots or histograms. Currently, mlr does not provide any plotting functionality for benchmark experiments. But based on the data.frame returned by getBMRPerformances some basic plots are easily done.
Shown below are boxplots for the accuracy, acc, and densityplots for the AUC, auc, generated by function qplot from package ggplot2.
qplot(y = acc, x = task.id, colour = learner.id, data = perf, geom = "boxplot")
qplot(auc, colour = learner.id, facets = . ~ task.id, data = perf, geom = "density")
In order to plot both performance measures in parallel perf
is reshaped to long format.
Below we generate grouped boxplots and densityplots for all tasks, learners and measures.
perfm = reshape2::melt(perf, id.vars = c("task.id", "learner.id", "iter"), measure.vars = c("acc", "auc"))
head(perfm)
#> task.id learner.id iter variable value
#> 1 PimaIndiansDiabetes-example lda 1 acc 0.7065217
#> 2 PimaIndiansDiabetes-example lda 2 acc 0.7789116
#> 3 PimaIndiansDiabetes-example lda 3 acc 0.7977941
#> 4 PimaIndiansDiabetes-example lda 4 acc 0.7781818
#> 5 PimaIndiansDiabetes-example lda 5 acc 0.7403509
#> 6 PimaIndiansDiabetes-example lda 6 acc 0.7681661
qplot(variable, value, data = perfm, colour = learner.id, facets = . ~ task.id, geom = "boxplot",
xlab = "measure", ylab = "performance")
qplot(value, data = perfm, colour = learner.id, facets = variable ~ task.id, geom = "density",
xlab = "performance")
It might also be useful to assess if learner performances in single resampling iterations, i.e., on the same bootstrap sample, are related. This might help to gain further insight, for example by having a closer look at bootstrap samples where one learner performs exceptionally well while another one is fairly bad. Moreover, this might be useful for the construction of ensembles of learning algorithms. Below, function ggpairs from package GGally is used to generate a scatterplot matrix of learner accuracies (acc) on the sonar data set.
perf = getBMRPerformances(res, task.id = "Sonar-example", as.df = TRUE)
perfr = reshape(perf, direction = "wide", v.names = c("acc", "auc"), timevar = "learner.id",
idvar = c("task.id", "iter"))
head(perfr)
#> task.id iter acc.lda auc.lda acc.rpart auc.rpart acc.rF
#> 1 Sonar-example 1 0.7468354 0.7928571 0.6202532 0.5928571 0.7341772
#> 2 Sonar-example 2 0.7466667 0.7642450 0.7066667 0.6613248 0.7200000
#> 3 Sonar-example 3 0.7968750 0.8666667 0.7031250 0.7406863 0.9062500
#> 4 Sonar-example 4 0.6666667 0.7309900 0.6266667 0.7109039 0.8000000
#> 5 Sonar-example 5 0.6883117 0.7335203 0.6753247 0.7685835 0.8051948
#> 6 Sonar-example 6 0.7792208 0.8367072 0.7272727 0.7864372 0.7922078
#> auc.rF
#> 1 0.9272727
#> 2 0.8828348
#> 3 0.9696078
#> 4 0.8884505
#> 5 0.8923562
#> 6 0.9460189
GGally::ggpairs(perfr, c(3,5,7))
Further comments
- In the examples shown in this section we applied "raw" learning algorithms, but often things are more complicated. At the very least, many learners have hyperparameters that need to be tuned to get sensible results. Reliable performance estimates can be obtained by nested resampling, i.e., by doing the tuning in an inner resampling loop while estimating the performance in an outer loop. Moreover, you might want to combine learners with pre-processing steps like imputation, scaling, outlier removal, dimensionality reduction or feature selection and so on. All this can be easily done by using mlr's wrapper functionality. The general principle is explained in the section about wrapped learners in the Advanced part of this tutorial. There are also several sections devoted to common pre-processing steps.
- Benchmark experiments can very quickly become computationally demanding. mlr offers some possibilities for parallelization.