Benchmark Experiments
In a benchmark experiment different learning methods are applied to one or several data sets with the aim to compare and rank the algorithms with respect to one or more performance measures.
In mlr a benchmark experiment can be conducted by calling function benchmark on a list of Learners and a list of Tasks. benchmark basically executes resample for each combination of Learner and Task. You can specify an individual resampling strategy for each Task and select one or multiple performance measures to be calculated.
Conducting a benchmark experiment
We start with a small example. Two learners, linear discriminant analysis and
classification trees, are applied to one classification problem (sonar.task).
As resampling strategy we choose "Holdout"
.
The performance is thus calculated on a single randomly sampled test data set.
In the example below we create a resample description (ResampleDesc), which is automatically instantiated by benchmark. The instantiation is done only once per Task, i.e., the same training and test sets are used for all learners. It is also possible to directly pass a ResampleInstance.
If you would like to use a fixed test data set instead of a randomly selected one, you can create a suitable ResampleInstance through function makeFixedHoldoutInstance.
## Two learners to be compared
lrns = list(makeLearner("classif.lda"), makeLearner("classif.rpart"))
## Choose the resampling strategy
rdesc = makeResampleDesc("Holdout")
## Conduct the benchmark experiment
bmr = benchmark(lrns, sonar.task, rdesc)
#> Task: Sonar-example, Learner: classif.lda
#> [Resample] holdout iter: 1
#> [Resample] Result: mmce.test.mean= 0.3
#> Task: Sonar-example, Learner: classif.rpart
#> [Resample] holdout iter: 1
#> [Resample] Result: mmce.test.mean=0.286
bmr
#> task.id learner.id mmce.test.mean
#> 1 Sonar-example classif.lda 0.3000000
#> 2 Sonar-example classif.rpart 0.2857143
In the printed table every row corresponds to one pair of Task and Learner. The entries show the mean misclassification error (mmce), the default performance measure for classification, on the test data set.
The result bmr
is an object of class BenchmarkResult. Basically, it contains a list
of lists of ResampleResult objects, first ordered by Task and then by Learner.
Accessing the benchmark result
mlr provides several accessor functions, named getBMR<WhatToExtract>
, that permit
to retrieve information for further analyses. This includes for example the performances
or predictions of the learning algorithms under consideration.
Let's have a look at the benchmark result above. getBMRPerformances returns individual performances in resampling runs, while getBMRAggrPerformances gives the aggregated values.
getBMRPerformances(bmr)
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#> iter mmce
#> 1 1 0.3
#>
#> $`Sonar-example`$classif.rpart
#> iter mmce
#> 1 1 0.2857143
getBMRAggrPerformances(bmr)
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#> mmce.test.mean
#> 0.3
#>
#> $`Sonar-example`$classif.rpart
#> mmce.test.mean
#> 0.2857143
Since we used holdout as resampling strategy, individual and aggregated performance values coincide.
Often it is more convenient to work with data.frames. You can easily
convert the result structure by setting as.df = TRUE
.
getBMRPerformances(bmr, as.df = TRUE)
#> task.id learner.id iter mmce
#> 1 Sonar-example classif.lda 1 0.3000000
#> 2 Sonar-example classif.rpart 1 0.2857143
getBMRAggrPerformances(bmr, as.df = TRUE)
#> task.id learner.id mmce.test.mean
#> 1 Sonar-example classif.lda 0.3000000
#> 2 Sonar-example classif.rpart 0.2857143
Function getBMRPredictions returns the predictions. Per default, you get a list of lists of ResamplePrediction objects. In most cases you might prefer the data.frame version.
getBMRPredictions(bmr)
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#> Resampled Prediction for:
#> Resample description: holdout with 0.67 split rate.
#> Predict: test
#> Stratification: FALSE
#> predict.type: response
#> threshold:
#> time (mean): 0.01
#> id truth response iter set
#> 180 180 M M 1 test
#> 100 100 M R 1 test
#> 53 53 R M 1 test
#> 89 89 R R 1 test
#> 92 92 R M 1 test
#> 11 11 R R 1 test
#>
#> $`Sonar-example`$classif.rpart
#> Resampled Prediction for:
#> Resample description: holdout with 0.67 split rate.
#> Predict: test
#> Stratification: FALSE
#> predict.type: response
#> threshold:
#> time (mean): 0.02
#> id truth response iter set
#> 180 180 M M 1 test
#> 100 100 M M 1 test
#> 53 53 R R 1 test
#> 89 89 R M 1 test
#> 92 92 R M 1 test
#> 11 11 R R 1 test
head(getBMRPredictions(bmr, as.df = TRUE))
#> task.id learner.id id truth response iter set
#> 1 Sonar-example classif.lda 180 M M 1 test
#> 2 Sonar-example classif.lda 100 M R 1 test
#> 3 Sonar-example classif.lda 53 R M 1 test
#> 4 Sonar-example classif.lda 89 R R 1 test
#> 5 Sonar-example classif.lda 92 R M 1 test
#> 6 Sonar-example classif.lda 11 R R 1 test
It is also easily possible to access results for certain learners or tasks via their
IDs. For this purpose many "getter" functions have a learner.ids
and a task.ids
argument.
head(getBMRPredictions(bmr, learner.ids = "classif.rpart", as.df = TRUE))
#> task.id learner.id id truth response iter set
#> 180 Sonar-example classif.rpart 180 M M 1 test
#> 100 Sonar-example classif.rpart 100 M M 1 test
#> 53 Sonar-example classif.rpart 53 R R 1 test
#> 89 Sonar-example classif.rpart 89 R M 1 test
#> 92 Sonar-example classif.rpart 92 R M 1 test
#> 11 Sonar-example classif.rpart 11 R R 1 test
If you don't like the default IDs, you can set the IDs of learners and tasks via the id
option of
makeLearner and make*Task.
Moreover, you can conveniently change the ID of a Learner via function setId.
The IDs of all Learners, Tasks and Measures in a benchmark experiment can be retrieved as follows:
getBMRTaskIds(bmr)
#> [1] "Sonar-example"
getBMRLearnerIds(bmr)
#> [1] "classif.lda" "classif.rpart"
getBMRMeasureIds(bmr)
#> [1] "mmce"
Moreover, you can extract the employed Learners and Measures.
getBMRLearners(bmr)
#> $classif.lda
#> Learner classif.lda from package MASS
#> Type: classif
#> Name: Linear Discriminant Analysis; Short name: lda
#> Class: classif.lda
#> Properties: twoclass,multiclass,numerics,factors,prob
#> Predict-Type: response
#> Hyperparameters:
#>
#>
#> $classif.rpart
#> Learner classif.rpart from package rpart
#> Type: classif
#> Name: Decision Tree; Short name: rpart
#> Class: classif.rpart
#> Properties: twoclass,multiclass,missings,numerics,factors,ordered,prob,weights
#> Predict-Type: response
#> Hyperparameters: xval=0
getBMRMeasures(bmr)
#> [[1]]
#> Name: Mean misclassification error
#> Performance measure: mmce
#> Properties: classif,classif.multi,req.pred,req.truth
#> Minimize: TRUE
#> Best: 0; Worst: 1
#> Aggregated by: test.mean
#> Note:
Benchmark analysis and visualization
mlr offers several possibilities to analyse the results of a benchmark experiment. This includes visualization, ranking of learning algorithms and hypothesis tests to assess performance differences between learners.
In order to demonstrate the functionality we conduct a slightly larger benchmark experiment with three learning algorithms that are applied to five classification tasks.
Example: Comparing lda, rpart and random Forest
We consider linear discriminant analysis (lda), classification trees (rpart), and random forests (randomForest). Since the default learner IDs are a little long, we choose shorter names in the R code below.
We use five classification tasks. Three are already provided by mlr, two more data sets are taken from package mlbench and converted to Tasks by function convertMLBenchObjToTask.
For all tasks 10-fold cross-validation is chosen as resampling strategy. This is achieved by passing a single resample description to benchmark, which is then instantiated automatically once for each Task. This way, the same instance is used for all learners applied to a single task.
It is also possible to choose a different resampling strategy for each Task by passing a list of the same length as the number of tasks that can contain both resample descriptions and resample instances.
We use the mean misclassification error mmce as primary performance measure, but also calculate the balanced error rate (ber) and the training time (timetrain).
## Create a list of learners
lrns = list(
makeLearner("classif.lda", id = "lda"),
makeLearner("classif.rpart", id = "rpart"),
makeLearner("classif.randomForest", id = "randomForest")
)
## Get additional Tasks from package mlbench
ring.task = convertMLBenchObjToTask("mlbench.ringnorm", n = 600)
wave.task = convertMLBenchObjToTask("mlbench.waveform", n = 600)
tasks = list(iris.task, sonar.task, pid.task, ring.task, wave.task)
rdesc = makeResampleDesc("CV", iters = 10)
meas = list(mmce, ber, timetrain)
bmr = benchmark(lrns, tasks, rdesc, meas, show.info = FALSE)
bmr
#> task.id learner.id mmce.test.mean ber.test.mean
#> 1 iris-example lda 0.02000000 0.02222222
#> 2 iris-example rpart 0.08000000 0.07555556
#> 3 iris-example randomForest 0.05333333 0.05250000
#> 4 mlbench.ringnorm lda 0.35000000 0.34605671
#> 5 mlbench.ringnorm rpart 0.17333333 0.17313632
#> 6 mlbench.ringnorm randomForest 0.05833333 0.05806121
#> 7 mlbench.waveform lda 0.19000000 0.18257244
#> 8 mlbench.waveform rpart 0.28833333 0.28765247
#> 9 mlbench.waveform randomForest 0.16500000 0.16306057
#> 10 PimaIndiansDiabetes-example lda 0.22778537 0.27148893
#> 11 PimaIndiansDiabetes-example rpart 0.25133288 0.28967870
#> 12 PimaIndiansDiabetes-example randomForest 0.23685919 0.27543146
#> 13 Sonar-example lda 0.24619048 0.23986694
#> 14 Sonar-example rpart 0.30785714 0.31153361
#> 15 Sonar-example randomForest 0.17785714 0.17442696
#> timetrain.test.mean
#> 1 0.0056
#> 2 0.0070
#> 3 0.1110
#> 4 0.0179
#> 5 0.0214
#> 6 0.6859
#> 7 0.0195
#> 8 0.0206
#> 9 0.7072
#> 10 0.0090
#> 11 0.0115
#> 12 0.6432
#> 13 0.0665
#> 14 0.0252
#> 15 0.4622
From the aggregated performance values we can see that for the iris- and PimaIndiansDiabetes-example linear discriminant analysis performs well while for all other tasks the random forest seems superior. Training takes longer for the random forest than for the other learners.
In order to draw any conclusions from the average performances at least their variability has to be taken into account or, preferably, the distribution of performance values across resampling iterations.
The individual performances on the 10 folds for every task, learner, and measure are retrieved below.
perf = getBMRPerformances(bmr, as.df = TRUE)
head(perf)
#> task.id learner.id iter mmce ber timetrain
#> 1 iris-example lda 1 0.0000000 0.0000000 0.006
#> 2 iris-example lda 2 0.1333333 0.1666667 0.006
#> 3 iris-example lda 3 0.0000000 0.0000000 0.005
#> 4 iris-example lda 4 0.0000000 0.0000000 0.005
#> 5 iris-example lda 5 0.0000000 0.0000000 0.005
#> 6 iris-example lda 6 0.0000000 0.0000000 0.009
A closer look at the result reveals that the random forest outperforms the classification tree in every instance, while linear discriminant analysis performs better than rpart most of the time. Additionally lda sometimes even beats the random forest. With increasing size of such benchmark experiments, those tables become almost unreadable and hard to comprehend.
mlr features some plotting functions to visualize results of benchmark experiments that you might find useful. Moreover, mlr offers statistical hypothesis tests to assess performance differences between learners.
Integrated plots
Plots are produced using ggplot2, as this package enables further customization, such as renaming plot elements or changing colors.
Visualizing performances
plotBMRBoxplots creates box or violin plots for a BenchmarkResult which show the distribution of performance values across resampling iterations for one performance measure and for all learners and tasks (and thus visualize the output of getBMRPerformances).
Below are both variants, box plots and violin plots. The first plot shows the mmce and the second plot the balanced error rate (ber). The panels are arranged in two rows. In the second plot we color the boxes according to the learners by adding additional aesthetics to make them better distinguishable.
plotBMRBoxplots(bmr, measure = mmce) +
facet_wrap(~ task.id, nrow = 2)
plotBMRBoxplots(bmr, measure = ber, style = "violin") +
aes(color = learner.id) +
facet_wrap(~ task.id, nrow = 2)
Visualizing aggregated performances
The aggregated performance values (resulting from getBMRAggrPerformances) can be visualized by function plotBMRSummary. This plot draws one line for each task on which the aggregated values of one performance measure for all learners are displayed. By default, the first measure in the list of Measures passed to benchmark is used, in our example mmce. Moreover, a small vertical jitter is added to prevent overplotting.
plotBMRSummary(bmr)
Calculating and visualizing ranks
Additional to the absolute performance, relative performance, i.e., ranking the learners is usually of interest and might provide valuable additional insight.
Function convertBMRToRankMatrix calculates ranks based on aggregated learner performances of one measure. We choose the mean misclassification error (mmce). The rank structure can be visualized by plotBMRRanksAsBarChart.
m = convertBMRToRankMatrix(bmr, mmce)
m
#> iris-example mlbench.ringnorm mlbench.waveform
#> lda 1 3 2
#> rpart 3 2 3
#> randomForest 2 1 1
#> PimaIndiansDiabetes-example Sonar-example
#> lda 1 2
#> rpart 3 3
#> randomForest 2 1
Methods with best performance, i.e., with lowest mmce, are assigned the lowest rank. Linear discriminant analysis is best for the iris and PimaIndiansDiabetes-examples while the random forest shows best results on the remaining tasks.
plotBMRRanksAsBarChart with option pos = "tile"
shows a corresponding heat map. The
ranks are displayed on the x-axis and the learners are color-coded.
plotBMRRanksAsBarChart(bmr, pos = "tile")
Alternatively, you can draw stacked bar charts (the default) or bar charts with juxtaposed
bars (pos = "dodge"
) that are better suited to compare the frequencies of learners within
and across ranks.
plotBMRRanksAsBarChart(bmr)
plotBMRRanksAsBarChart(bmr, pos = "dodge")
Comparing learners using hypothesis tests
Many researchers feel the need to display an algorithm's superiority by employing some sort of hypothesis testing. As non-parametric tests seem better suited for such benchmark results the tests provided in mlr are the Overall Friedman test and the Friedman-Nemenyi post hoc test.
While the ad hoc Friedman test based on friedman.test from the stats package is testing the hypothesis whether there is a significant difference between the employed learners, the post hoc Friedman-Nemenyi test tests for significant differences between all pairs of learners. Non parametric tests often do have less power then their parametric counterparts but less assumptions about underlying distributions have to be made. This often means many data sets are needed in order to be able to show significant differences at reasonable significance levels.
In our example, we want to compare the three learners on the selected data sets. First we might we want to test the hypothesis whether there is a difference between the learners.
friedmanTestBMR(bmr)
#>
#> Friedman rank sum test
#>
#> data: x and learner.id and task.id
#> Friedman chi-squared = 5.2, df = 2, p-value = 0.07427
In order to keep the computation time for this tutorial small, the Learners are only evaluated on five tasks. This also means that we operate on a relatively low significance level . As we can reject the null hypothesis of the Friedman test at a reasonable significance level we might now want to test where these differences lie exactly.
friedmanPostHocTestBMR(bmr, p.value = 0.1)
#>
#> Pairwise comparisons using Nemenyi multiple comparison test
#> with q approximation for unreplicated blocked data
#>
#> data: x and learner.id and task.id
#>
#> lda rpart
#> rpart 0.254 -
#> randomForest 0.802 0.069
#>
#> P value adjustment method: none
At this level of significance, we can reject the null hypothesis that there exists no performance difference between the decision tree (rpart) and the random Forest.
Critical differences diagram
In order to visualize differently performing learners, a
critical differences diagram can be plotted, using either the
Nemenyi test (test = "nemenyi"
) or the Bonferroni-Dunn test (test = "bd"
).
The mean rank of learners is displayed on the x-axis.
- Choosing
test = "nemenyi"
compares all pairs of Learners to each other, thus the output are groups of not significantly different learners. The diagram connects all groups of learners where the mean ranks do not differ by more than the critical differences. Learners that are not connected by a bar are significantly different, and the learner(s) with the lower mean rank can be considered "better" at the chosen significance level. - Choosing
test = "bd"
performs a pairwise comparison with a baseline. An interval which extends by the given critical difference in both directions is drawn around the Learner chosen as baseline, though only comparisons with the baseline are possible. All learners within the interval are not significantly different, while the baseline can be considered better or worse than a given learner which is outside of the interval.
The critical difference is calculated by where denotes the number of tasks, is the number of learners, and comes from the studentized range statistic divided by . For details see Demsar (2006).
Function generateCritDifferencesData does all necessary calculations while function plotCritDifferences draws the plot. See the tutorial page about visualization for details on data generation and plotting functions.
## Nemenyi test
g = generateCritDifferencesData(bmr, p.value = 0.1, test = "nemenyi")
plotCritDifferences(g) + coord_cartesian(xlim = c(-1,5), ylim = c(0,2))
## Bonferroni-Dunn test
g = generateCritDifferencesData(bmr, p.value = 0.1, test = "bd", baseline = "randomForest")
plotCritDifferences(g) + coord_cartesian(xlim = c(-1,5), ylim = c(0,2))
Custom plots
You can easily generate your own visualizations by customizing the ggplot objects returned by the plots above, retrieve the data from the ggplot objects and use them as basis for your own plots, or rely on the data.frames returned by getBMRPerformances or getBMRAggrPerformances. Here are some examples.
Instead of boxplots (as in plotBMRBoxplots) we could create density plots to show the performance values resulting from individual resampling iterations.
perf = getBMRPerformances(bmr, as.df = TRUE)
## Density plots for two tasks
qplot(mmce, colour = learner.id, facets = . ~ task.id,
data = perf[perf$task.id %in% c("iris-example", "Sonar-example"),], geom = "density")
In order to plot multiple performance measures in parallel, perf
is reshaped to long format.
Below we generate grouped boxplots showing the error rate (mmce) and the
training time timetrain.
## Compare mmce and timetrain
df = reshape2::melt(perf, id.vars = c("task.id", "learner.id", "iter"))
df = df[df$variable != "ber",]
head(df)
#> task.id learner.id iter variable value
#> 1 iris-example lda 1 mmce 0.0000000
#> 2 iris-example lda 2 mmce 0.1333333
#> 3 iris-example lda 3 mmce 0.0000000
#> 4 iris-example lda 4 mmce 0.0000000
#> 5 iris-example lda 5 mmce 0.0000000
#> 6 iris-example lda 6 mmce 0.0000000
qplot(variable, value, data = df, colour = learner.id, geom = "boxplot",
xlab = "measure", ylab = "performance") +
facet_wrap(~ task.id, nrow = 2)
It might also be useful to assess if learner performances in single resampling iterations, i.e., in one fold, are related. This might help to gain further insight, for example by having a closer look at train and test sets from iterations where one learner performs exceptionally well while another one is fairly bad. Moreover, this might be useful for the construction of ensembles of learning algorithms. Below, function ggpairs from package GGally is used to generate a scatterplot matrix of mean misclassification errors (mmce) on the Sonar data set.
perf = getBMRPerformances(bmr, task.id = "Sonar-example", as.df = TRUE)
df = reshape2::melt(perf, id.vars = c("task.id", "learner.id", "iter"))
df = df[df$variable == "mmce",]
df = reshape2::dcast(df, task.id + iter ~ variable + learner.id)
head(df)
#> task.id iter mmce_lda mmce_rpart mmce_randomForest
#> 1 Sonar-example 1 0.2857143 0.2857143 0.14285714
#> 2 Sonar-example 2 0.2380952 0.2380952 0.23809524
#> 3 Sonar-example 3 0.3333333 0.2857143 0.28571429
#> 4 Sonar-example 4 0.2380952 0.3333333 0.04761905
#> 5 Sonar-example 5 0.1428571 0.2857143 0.19047619
#> 6 Sonar-example 6 0.4000000 0.4500000 0.25000000
GGally::ggpairs(df, 3:5)
Further comments
- Note that for supervised classification mlr offers some more plots that operate on BenchmarkResult objects and allow you to compare the performance of learning algorithms. See for example the tutorial page on ROC curves and functions generateROCRCurvesData, plotViperCharts, and generateThreshVsPerfData as well as the page about classifier calibration and function generateCalibrationData.
- In the examples shown in this section we applied "raw" learning algorithms, but often things are more complicated. At the very least, many learners have hyperparameters that need to be tuned to get sensible results. Reliable performance estimates can be obtained by nested resampling, i.e., by doing the tuning in an inner resampling loop while estimating the performance in an outer loop. Moreover, you might want to combine learners with pre-processing steps like imputation, scaling, outlier removal, dimensionality reduction or feature selection and so on. All this can be easily done using mlr's wrapper functionality. The general principle is explained in the section about wrapped learners in the Advanced part of this tutorial. There are also several sections devoted to common pre-processing steps.
- Benchmark experiments can very quickly become computationally demanding. mlr offers some possibilities for parallelization.