Evaluating Learner Performance

The quality of the predictions of a model in mlr can be assessed with respect to a number of different performance measures. In order to calculate the performance measures, call performance on the object returned by predict and specify the desired performance measures.

Available performance measures

mlr provides a large number of performance measures for all types of learning problems. Typical performance measures for classification are the mean misclassification error (mmce), accuracy (acc) or measures based on ROC analysis. For regression the mean of squared errors (mse) or mean of absolute errors (mae) are usually considered. For clustering tasks, measures such as the Dunn index (dunn) are provided, while for survival predictions, the Concordance Index (cindex) is supported, and for cost-sensitive predictions the misclassification penalty (mcp) and others. It is also possible to access the time to train the learner (timetrain), the time to compute the prediction (timepredict) and their sum (timeboth) as performance measures.

To see which performance measures are implemented, have a look at the table of performance measures and the measures documentation page.

If you want to implement an additional measure or include a measure with non-standard misclassification costs, see the section on creating custom measures.

Listing measures

The properties and requirements of the individual measures are shown in the table of performance measures.

If you would like a list of available measures with certain properties or suitable for a certain learning Task use the function listMeasures.

## Performance measures for classification with multiple classes
listMeasures("classif", properties = "classif.multi")
#> [1] "featperc"       "mmce"           "timeboth"       "acc"           
#> [5] "multiclass.auc" "ber"            "timepredict"    "timetrain"
## Performance measure suitable for the iris classification task
listMeasures(iris.task)
#> [1] "featperc"       "mmce"           "timeboth"       "acc"           
#> [5] "multiclass.auc" "ber"            "timepredict"    "timetrain"

Calculate performance measures

In the following example we fit a gradient boosting machine on a subset of the BostonHousing data set and calculate the mean squared error (mse) on the remaining observations.

n = getTaskSize(bh.task)
lrn = makeLearner("regr.gbm", n.trees = 1000)
mod = train(lrn, task = bh.task, subset = seq(1, n, 2))
pred = predict(mod, task = bh.task, subset = seq(2, n, 2))

# mse is the default measure for regression, we do not have to specify
# it here
performance(pred)
#>      mse 
#> 42.68414

The following code computes the median of squared errors (medse) instead.

performance(pred, measures = medse)
#>    medse 
#> 9.134965

Of course, we can also calculate multiple performance measures at once by simply passing a list of measures which can also include your own measure.

Calculate the mean squared error, median squared error and mean absolute error (mae).

performance(pred, measures = list(mse, medse, mae))
#>       mse     medse       mae 
#> 42.684141  9.134965  4.536750

For the other types of learning problems and measures, calculating the performance basically works in the same way.

Requirements of performance measures

Note that in order to calculate some performance measures it is required that you pass the Task or the fitted model in addition to the Prediction.

For example in order to assess the time needed for training (timetrain), the fitted model has to be passed.

performance(pred, measures = timetrain, model = mod)
#> timetrain 
#>     0.158

For many performance measures in cluster analysis the Task is required.

lrn = makeLearner("cluster.kmeans", centers = 3)
mod = train(lrn, mtcars.task)
pred = predict(mod, task = mtcars.task)

## Calculate the Dunn index
performance(pred, measures = dunn, task = mtcars.task)
#>      dunn 
#> 0.1462919

Moreover, some measures require a certain type of prediction. For example in binary classification in order to calculate the AUC (auc) -- the area under the ROC (receiver operating characteristic) curve -- we have to make sure that posterior probabilities are predicted. For more information on ROC analysis, see the section on ROC analysis.

lrn = makeLearner("classif.rpart", predict.type = "prob")
mod = train(lrn, task = sonar.task)
pred = predict(mod, task = sonar.task)

performance(pred, measures = auc)
#>       auc 
#> 0.9224018

Also bear in mind that many of the performance measures that are available for classification, e.g., the false positive rate (fpr), are only suitable for binary problems.

Access a performance measure

Performance measures in mlr are objects of class Measure. If you are interested in the properties or requirements of a single measure you can access it directly. See the help page of Measure for information on the individual slots.

## Mean misclassification error
str(mmce)
#> List of 10
#>  $ id        : chr "mmce"
#>  $ minimize  : logi TRUE
#>  $ properties: chr [1:4] "classif" "classif.multi" "req.pred" "req.truth"
#>  $ fun       :function (task, model, pred, feats, extra.args)  
#>  $ extra.args: list()
#>  $ best      : num 0
#>  $ worst     : num 1
#>  $ name      : chr "Mean misclassification error"
#>  $ note      : chr ""
#>  $ aggr      :List of 2
#>   ..$ id : chr "test.mean"
#>   ..$ fun:function (task, perf.test, perf.train, measure, group, pred)  
#>   ..- attr(*, "class")= chr "Aggregation"
#>  - attr(*, "class")= chr "Measure"

Binary classification: Plot performance versus threshold

As you may recall (see the previous section on making predictions) in binary classification we can adjust the threshold used to map probabilities to class labels. Helpful in this regard is are the functions generateThreshVsPerfData and plotThreshVsPerf, which generate and plot, respectively, the learner performance versus the threshold.

For more performance plots and automatic threshold tuning see here.

In the following example we consider the Sonar data set and plot the false positive rate (fpr), the false negative rate (fnr) as well as the misclassification rate (mmce) for all possible threshold values.

lrn = makeLearner("classif.lda", predict.type = "prob")
n = getTaskSize(sonar.task)
mod = train(lrn, task = sonar.task, subset = seq(1, n, by = 2))
pred = predict(mod, task = sonar.task, subset = seq(2, n, by = 2))

## Performance for the default threshold 0.5
performance(pred, measures = list(fpr, fnr, mmce))
#>       fpr       fnr      mmce 
#> 0.2500000 0.3035714 0.2788462
## Plot false negative and positive rates as well as the error rate versus the threshold
d = generateThreshVsPerfData(pred, measures = list(fpr, fnr, mmce))
plotThreshVsPerf(d)

plot of chunk unnamed-chunk-9

There is an experimental ggvis plotting function plotThreshVsPerfGGVIS which performs similarly to plotThreshVsPerf but instead of creating facetted subplots to visualize multiple learners and/or multiple measures, one of them is mapped to an interactive sidebar which selects what to display.

plotThreshVsPerfGGVIS(d)