Benchmark Experiments

In a benchmark experiment different learning methods are applied to one or several data sets with the aim to compare and rank the algorithms with respect to one or more performance measures.

In mlr a benchmark experiment can be conducted by calling function benchmark on a list of Learners and a list of Tasks. benchmark basically executes resample for each combination of Learner and Task. You can specify an individual resampling strategy for each Task and select one or multiple performance measures to be calculated.

Example: One task, two learners, prediction on a single test set

We start with a small example. Two learners, linear discriminant analysis and classification trees, are applied to one classification problem (sonar.task). As resampling strategy we choose "Holdout". The performance is thus calculated on one single randomly sampled test data set.

In the example below we create a resample description (ResampleDesc), which is automatically instantiated by benchmark. The instantiation is done only once, that is, the same resample instance (ResampleInstance) is used for all learners applied to the Task. It's also possible to directly pass a ResampleInstance.

If you would like to use a fixed test data set instead of a randomly selected one, you can create a suitable ResampleInstance through function makeFixedHoldoutInstance.

## Two learners to be compared
lrns = list(makeLearner("classif.lda"), makeLearner("classif.rpart"))

## Choose the resampling strategy
rdesc = makeResampleDesc("Holdout")

## Conduct the benchmark experiment
res = benchmark(lrns, sonar.task, rdesc)
#> Task: Sonar-example, Learner: classif.lda
#> [Resample] holdout iter: 1
#> [Resample] Result: mmce.test.mean= 0.3
#> Task: Sonar-example, Learner: classif.rpart
#> [Resample] holdout iter: 1
#> [Resample] Result: mmce.test.mean=0.286

res
#>         task.id    learner.id mmce.test.mean
#> 1 Sonar-example   classif.lda      0.3000000
#> 2 Sonar-example classif.rpart      0.2857143

In the printed table every row corresponds to one pair of Task and Learner. The entries show the mean misclassification error (mmce), the default performance measure for classification, on the test data set.

The result res is an object of class BenchmarkResult. Basically, this is a list of lists of ResampleResult objects, first ordered by Task and then by Learner.

mlr provides several accessor functions, named getBMR<what_to_extract>, that permit to retrieve information for further analyses. This includes for example the performances or predictions of the learning algorithms under consideration.

Let's have a look at the benchmark result above. getBMRPerformances returns individual performances in resampling runs, while getBMRAggrPerformances gives the aggregated values.

getBMRPerformances(res)
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#>   iter mmce
#> 1    1  0.3
#> 
#> $`Sonar-example`$classif.rpart
#>   iter      mmce
#> 1    1 0.2857143

getBMRAggrPerformances(res)
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#> mmce.test.mean 
#>            0.3 
#> 
#> $`Sonar-example`$classif.rpart
#> mmce.test.mean 
#>      0.2857143

Since we used holdout as resampling strategy, individual and aggregated performances are the same.

Often it is more convenient to work with data.frames. You can easily convert the result structure by setting as.df = TRUE.

getBMRPerformances(res, as.df = TRUE)
#>         task.id    learner.id iter      mmce
#> 1 Sonar-example   classif.lda    1 0.3000000
#> 2 Sonar-example classif.rpart    1 0.2857143

getBMRAggrPerformances(res, as.df = TRUE)
#>         task.id    learner.id mmce.test.mean
#> 1 Sonar-example   classif.lda      0.3000000
#> 2 Sonar-example classif.rpart      0.2857143

Function getBMRPredictions returns the predictions. Per default, you get a list of lists of ResamplePrediction objects. In most cases you might prefer the data.frame version.

getBMRPredictions(res)
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#> Resampled Prediction for:
#> Resample description: holdout with 0.67 split rate.
#> Predict: test
#> Stratification: FALSE
#> predict.type: response
#> threshold: 
#> time (mean): 0.01
#>      id truth response iter  set
#> 180 180     M        M    1 test
#> 100 100     M        R    1 test
#> 53   53     R        M    1 test
#> 89   89     R        R    1 test
#> 92   92     R        M    1 test
#> 11   11     R        R    1 test
#> 
#> $`Sonar-example`$classif.rpart
#> Resampled Prediction for:
#> Resample description: holdout with 0.67 split rate.
#> Predict: test
#> Stratification: FALSE
#> predict.type: response
#> threshold: 
#> time (mean): 0.01
#>      id truth response iter  set
#> 180 180     M        M    1 test
#> 100 100     M        M    1 test
#> 53   53     R        R    1 test
#> 89   89     R        M    1 test
#> 92   92     R        M    1 test
#> 11   11     R        R    1 test

head(getBMRPredictions(res, as.df = TRUE))
#>         task.id  learner.id  id truth response iter  set
#> 1 Sonar-example classif.lda 180     M        M    1 test
#> 2 Sonar-example classif.lda 100     M        R    1 test
#> 3 Sonar-example classif.lda  53     R        M    1 test
#> 4 Sonar-example classif.lda  89     R        R    1 test
#> 5 Sonar-example classif.lda  92     R        M    1 test
#> 6 Sonar-example classif.lda  11     R        R    1 test

It is also easily possible to access results for certain learners or tasks via their IDs. Nearly all "getter" functions have a learner.ids and a task.ids argument.

head(getBMRPredictions(res, learner.ids = "classif.rpart", as.df = TRUE))
#>           task.id    learner.id  id truth response iter  set
#> 180 Sonar-example classif.rpart 180     M        M    1 test
#> 100 Sonar-example classif.rpart 100     M        M    1 test
#> 53  Sonar-example classif.rpart  53     R        R    1 test
#> 89  Sonar-example classif.rpart  89     R        M    1 test
#> 92  Sonar-example classif.rpart  92     R        M    1 test
#> 11  Sonar-example classif.rpart  11     R        R    1 test

As you might recall, you can use the id option in makeLearner or function setId to set the ID of a Learner and the id option of make*Task for Task IDs.

The IDs of all Learners and Tasks in a benchmark experiment can be retrieved as follows:

getBMRTaskIds(res)
#> [1] "Sonar-example"

getBMRLearnerIds(res)
#> [1] "classif.lda"   "classif.rpart"

Example: Two tasks, three learners, bootstrapping

Let's have a look at a larger benchmark experiment with two classification tasks (pid.task and sonar.task) and three learning algorithms. Since the default learner IDs are a little long, we choose shorter names.

For both tasks bootstrapping with 20 iterations is chosen as resampling strategy. This is achieved by passing a single resample description to benchmark, which is then instantiated automatically once for each Task. Thus, the same instance is used for all learners applied to one task.

It's also possible to choose a different resampling strategy for each Task by passing a list of the same length as the number of tasks that can contain both ResampleDescs and ResampleInstances.

In the example below the accuracy (acc) and the area under curve (auc) are calculated.

## Three learners to be compared
lrns = list(makeLearner("classif.lda", predict.type = "prob", id = "lda"),
  makeLearner("classif.rpart", predict.type = "prob", id = "rpart"),
  makeLearner("classif.randomForest", predict.type = "prob", id = "rF"))

## Two classification tasks
tasks = list(pid.task, sonar.task)

## Use bootstrapping for both tasks
rdesc = makeResampleDesc("Bootstrap", iters = 20)

## Conduct the benchmark experiment
res = benchmark(lrns, tasks, rdesc, measures = list(acc, auc), show.info = FALSE)

res
#>                       task.id learner.id acc.test.mean auc.test.mean
#> 1 PimaIndiansDiabetes-example        lda     0.7677489     0.8237077
#> 2 PimaIndiansDiabetes-example      rpart     0.7386807     0.7367115
#> 3 PimaIndiansDiabetes-example         rF     0.7600432     0.8166053
#> 4               Sonar-example        lda     0.7137996     0.7755366
#> 5               Sonar-example      rpart     0.6985412     0.7331386
#> 6               Sonar-example         rF     0.8156073     0.9184084

The entries in the printed table show the aggregated accuracies and AUC values. On the Pima data lda and random forest show nearly identical performance. On the sonar example random forest has highest accuracy and AUC.

Instead of just comparing mean performance values it's generally preferable to have a look at the distribution of performance values obtained in individual resampling runs. The individual performances on the 20 bootstrap iterations for every task and learner are retrieved below.

perf = getBMRPerformances(res, as.df = TRUE)
head(perf)
#>                       task.id learner.id iter       acc       auc
#> 1 PimaIndiansDiabetes-example        lda    1 0.7065217 0.7644810
#> 2 PimaIndiansDiabetes-example        lda    2 0.7789116 0.8158948
#> 3 PimaIndiansDiabetes-example        lda    3 0.7977941 0.8517787
#> 4 PimaIndiansDiabetes-example        lda    4 0.7781818 0.8162754
#> 5 PimaIndiansDiabetes-example        lda    5 0.7403509 0.7966082
#> 6 PimaIndiansDiabetes-example        lda    6 0.7681661 0.8356029

As part of a first exploratory analysis you might want to create some plots, for example dotplots, boxplots, densityplots or histograms. Currently, mlr does not provide any plotting functionality for benchmark experiments. But based on the data.frame returned by getBMRPerformances some basic plots are easily done.

Shown below are boxplots for the accuracy, acc, and densityplots for the AUC, auc, generated by function qplot from package ggplot2.

qplot(y = acc, x = task.id, colour = learner.id, data = perf, geom = "boxplot")

plot of chunk unnamed-chunk-9

qplot(auc, colour = learner.id, facets = . ~ task.id, data = perf, geom = "density")

plot of chunk unnamed-chunk-9

In order to plot both performance measures in parallel perf is reshaped to long format. Below we generate grouped boxplots and densityplots for all tasks, learners and measures.

perfm = reshape2::melt(perf, id.vars = c("task.id", "learner.id", "iter"), measure.vars = c("acc", "auc"))
head(perfm)
#>                       task.id learner.id iter variable     value
#> 1 PimaIndiansDiabetes-example        lda    1      acc 0.7065217
#> 2 PimaIndiansDiabetes-example        lda    2      acc 0.7789116
#> 3 PimaIndiansDiabetes-example        lda    3      acc 0.7977941
#> 4 PimaIndiansDiabetes-example        lda    4      acc 0.7781818
#> 5 PimaIndiansDiabetes-example        lda    5      acc 0.7403509
#> 6 PimaIndiansDiabetes-example        lda    6      acc 0.7681661

qplot(variable, value, data = perfm, colour = learner.id, facets = . ~ task.id, geom = "boxplot",
  xlab = "measure", ylab = "performance")

plot of chunk unnamed-chunk-10

qplot(value, data = perfm, colour = learner.id, facets = variable ~ task.id, geom = "density",
  xlab = "performance")

plot of chunk unnamed-chunk-10

It might also be useful to assess if learner performances in single resampling iterations, i.e., on the same bootstrap sample, are related. This might help to gain further insight, for example by having a closer look at bootstrap samples where one learner performs exceptionally well while another one is fairly bad. Moreover, this might be useful for the construction of ensembles of learning algorithms. Below, function ggpairs from package GGally is used to generate a scatterplot matrix of learner accuracies (acc) on the sonar data set.

perf = getBMRPerformances(res, task.id = "Sonar-example", as.df = TRUE)
perfr = reshape(perf, direction = "wide", v.names = c("acc", "auc"), timevar = "learner.id",
  idvar = c("task.id", "iter"))
head(perfr)
#>         task.id iter   acc.lda   auc.lda acc.rpart auc.rpart    acc.rF
#> 1 Sonar-example    1 0.7468354 0.7928571 0.6202532 0.5928571 0.7341772
#> 2 Sonar-example    2 0.7466667 0.7642450 0.7066667 0.6613248 0.7200000
#> 3 Sonar-example    3 0.7968750 0.8666667 0.7031250 0.7406863 0.9062500
#> 4 Sonar-example    4 0.6666667 0.7309900 0.6266667 0.7109039 0.8000000
#> 5 Sonar-example    5 0.6883117 0.7335203 0.6753247 0.7685835 0.8051948
#> 6 Sonar-example    6 0.7792208 0.8367072 0.7272727 0.7864372 0.7922078
#>      auc.rF
#> 1 0.9272727
#> 2 0.8828348
#> 3 0.9696078
#> 4 0.8884505
#> 5 0.8923562
#> 6 0.9460189

GGally::ggpairs(perfr, c(3,5,7))

plot of chunk unnamed-chunk-11

Further comments