Multilabel Classification
Multilabel classification is a classification problem where multiple target labels can be assigned to each observation instead of only one like in multiclass classification.
Two different approaches exist for multilabel classification. Problem transformation methods try to transform the multilabel classification into binary or multiclass classification problems. Algorithm adaptation methods adapt multiclass algorithms so they can be applied directly to the problem.
Creating a task
The first thing you have to do for multilabel classification in mlr is to get your data in the right format. You need a data.frame which consists of the features and a logical vector for each label which indicates if the label is present in the observation or not. After that you can create a MultilabelTask like a normal ClassifTask. Instead of one target name you have to specify a vector of targets which correspond to the names of logical variables in the data.frame. In the following example we get the yeast data frame from the already existing yeast.task, extract the 14 label names and create the task again.
yeast = getTaskData(yeast.task)
labels = colnames(yeast)[1:14]
yeast.task = makeMultilabelTask(id = "multi", data = yeast, target = labels)
yeast.task
#> Supervised task: multi
#> Type: multilabel
#> Target: label1,label2,label3,label4,label5,label6,label7,label8,label9,label10,label11,label12,label13,label14
#> Observations: 2417
#> Features:
#> numerics factors ordered
#> 103 0 0
#> Missings: FALSE
#> Has weights: FALSE
#> Has blocking: FALSE
#> Classes: 14
#> label1 label2 label3 label4 label5 label6 label7 label8 label9
#> 762 1038 983 862 722 597 428 480 178
#> label10 label11 label12 label13 label14
#> 253 289 1816 1799 34
Constructing a learner
Multilabel classification in mlr can currently be done in two ways:
-
Use the binary relevance method. This problem transformation method converts the multilabel problem to binary classification problems for each label and applies a simple binary classificator on these. In mlr this can be done by converting your binary learner to a wrapped binary relevance multilabel learner.
-
Apply directly an algorithm adaptation method which treats the whole problem with a specific algorithm.
Binary relevance method
For generating a wrapped multilabel learner first create a binary (or multiclass) classification learner with makeLearner. Afterwards apply the function makeMultilabelBinaryRelevanceWrapper to the learner to convert it into a binary relevance learner.
You can also generate a binary relevance learner directly, as you can see in the example.
multilabel.lrn = makeLearner("classif.rpart", predict.type = "prob")
multilabel.lrn = makeMultilabelBinaryRelevanceWrapper(multilabel.lrn)
multilabel.lrn
#> Learner multilabel.classif.rpart from package rpart
#> Type: multilabel
#> Name: ; Short name:
#> Class: MultilabelBinaryRelevanceWrapper
#> Properties: numerics,factors,ordered,missings,weights,prob,twoclass,multiclass
#> Predict-Type: prob
#> Hyperparameters: xval=0
multilabel.lrn1 = makeMultilabelBinaryRelevanceWrapper("classif.rpart")
multilabel.lrn1
#> Learner multilabel.classif.rpart from package rpart
#> Type: multilabel
#> Name: ; Short name:
#> Class: MultilabelBinaryRelevanceWrapper
#> Properties: numerics,factors,ordered,missings,weights,prob,twoclass,multiclass
#> Predict-Type: response
#> Hyperparameters: xval=0
Algorithm adaptation method
Currently the only available algorithm adaptation method in R is the Random Ferns multilabel algorithm in the rFerns package. You can create the learner for this algorithm like in multiclass classification problems.
multilabel.lrn2 = makeLearner("multilabel.rFerns")
multilabel.lrn2
#> Learner multilabel.rFerns from package rFerns
#> Type: multilabel
#> Name: Random ferns; Short name: rFerns
#> Class: multilabel.rFerns
#> Properties: numerics,factors,ordered
#> Predict-Type: response
#> Hyperparameters:
Train
You can train a model as usual with a multilabel learner and a
multilabel task as input. You can also pass subset
and weights
arguments if the
learner supports this.
mod = train(multilabel.lrn, yeast.task)
mod = train(multilabel.lrn, yeast.task, subset = 1:1500, weights = rep(1/1500, 1500))
mod
#> Model for learner.id=multilabel.classif.rpart; learner.class=MultilabelBinaryRelevanceWrapper
#> Trained on: task.id = multi; obs = 1500; features = 103
#> Hyperparameters: xval=0
mod2 = train(multilabel.lrn2, yeast.task, subset = 1:100)
mod2
#> Model for learner.id=multilabel.rFerns; learner.class=multilabel.rFerns
#> Trained on: task.id = multi; obs = 100; features = 103
#> Hyperparameters:
Predict
Prediction can be done as usual in mlr with predict and by
passing a trained model
and either the task to the task
argument or some new data to the newdata
argument. As always you can specify a subset
of the data
which should be predicted.
pred = predict(mod, task = yeast.task, subset = 1:10)
pred = predict(mod, newdata = yeast[1501:1600,])
names(as.data.frame(pred))
#> [1] "truth.label1" "truth.label2" "truth.label3"
#> [4] "truth.label4" "truth.label5" "truth.label6"
#> [7] "truth.label7" "truth.label8" "truth.label9"
#> [10] "truth.label10" "truth.label11" "truth.label12"
#> [13] "truth.label13" "truth.label14" "prob.label1"
#> [16] "prob.label2" "prob.label3" "prob.label4"
#> [19] "prob.label5" "prob.label6" "prob.label7"
#> [22] "prob.label8" "prob.label9" "prob.label10"
#> [25] "prob.label11" "prob.label12" "prob.label13"
#> [28] "prob.label14" "response.label1" "response.label2"
#> [31] "response.label3" "response.label4" "response.label5"
#> [34] "response.label6" "response.label7" "response.label8"
#> [37] "response.label9" "response.label10" "response.label11"
#> [40] "response.label12" "response.label13" "response.label14"
pred2 = predict(mod2, task = yeast.task)
names(as.data.frame(pred2))
#> [1] "id" "truth.label1" "truth.label2"
#> [4] "truth.label3" "truth.label4" "truth.label5"
#> [7] "truth.label6" "truth.label7" "truth.label8"
#> [10] "truth.label9" "truth.label10" "truth.label11"
#> [13] "truth.label12" "truth.label13" "truth.label14"
#> [16] "response.label1" "response.label2" "response.label3"
#> [19] "response.label4" "response.label5" "response.label6"
#> [22] "response.label7" "response.label8" "response.label9"
#> [25] "response.label10" "response.label11" "response.label12"
#> [28] "response.label13" "response.label14"
Depending on the chosen predict.type
of the learner you get true and predicted values and
possibly probabilities for each class label.
These can be extracted by the usual accessor functions getPredictionTruth, getPredictionResponse
and getPredictionProbabilities.
Performance
The performance of your prediction can be assessed via function performance.
You can specify via the measures
argument which measure(s) to calculate.
The default measure for multilabel classification is the Hamming loss (hamloss).
All available measures for multilabel classification can be shown by listMeasures.
performance(pred)
#> hamloss
#> 0.2257143
performance(pred2, measures = list(hamloss, timepredict))
#> hamloss timepredict
#> 0.7033217 0.0940000
listMeasures("multilabel")
#> [1] "timepredict" "featperc" "timeboth" "timetrain" "hamloss"
Resampling
For evaluating the overall performance of the learning algorithm you can do some resampling. As usual you have to define a resampling strategy, either via makeResampleDesc or makeResampleInstance. After that you can run the resample function. Below the default measure Hamming loss is calculated.
rdesc = makeResampleDesc(method = "CV", stratify = FALSE, iters = 3)
r = resample(learner = multilabel.lrn, task = yeast.task, resampling = rdesc, show.info = FALSE)
r
#> Resample Result
#> Task: multi
#> Learner: multilabel.classif.rpart
#> hamloss.aggr: 0.23
#> hamloss.mean: 0.23
#> hamloss.sd: 0.01
#> Runtime: 12.0061
r = resample(learner = multilabel.lrn2, task = yeast.task, resampling = rdesc, show.info = FALSE)
r
#> Resample Result
#> Task: multi
#> Learner: multilabel.rFerns
#> hamloss.aggr: 0.47
#> hamloss.mean: 0.47
#> hamloss.sd: 0.01
#> Runtime: 0.787593
Binary performance
If you want to calculate a binary performance measure like, e.g., the accuracy, the mmce or the auc for each label, you can use function getMultilabelBinaryPerformances. You can apply this function to any multilabel prediction, e.g., also on the resample multilabel prediction. For calculating the auc you need predicted probabilities.
getMultilabelBinaryPerformances(pred, measures = list(acc, mmce, auc))
#> acc.test.mean mmce.test.mean auc.test.mean
#> label1 0.75 0.25 0.6321925
#> label2 0.64 0.36 0.6547917
#> label3 0.68 0.32 0.7118227
#> label4 0.69 0.31 0.6764835
#> label5 0.73 0.27 0.6676923
#> label6 0.70 0.30 0.6417739
#> label7 0.81 0.19 0.5968750
#> label8 0.73 0.27 0.5164474
#> label9 0.89 0.11 0.4688458
#> label10 0.86 0.14 0.3996463
#> label11 0.85 0.15 0.5000000
#> label12 0.76 0.24 0.5330667
#> label13 0.75 0.25 0.5938610
#> label14 1.00 0.00 NA
getMultilabelBinaryPerformances(r$pred, measures = list(acc, mmce))
#> acc.test.mean mmce.test.mean
#> label1 0.69424907 0.3057509
#> label2 0.59288374 0.4071163
#> label3 0.70459247 0.2954075
#> label4 0.71286719 0.2871328
#> label5 0.71328093 0.2867191
#> label6 0.58709144 0.4129086
#> label7 0.54778651 0.4522135
#> label8 0.53496070 0.4650393
#> label9 0.31443939 0.6855606
#> label10 0.44807613 0.5519239
#> label11 0.45717832 0.5428217
#> label12 0.52461729 0.4753827
#> label13 0.53289201 0.4671080
#> label14 0.01406703 0.9859330