Predicting Outcomes for New Data
Predicting the target values for new observations is implemented the same way as most of the other predict methods in R. In general, all you need to do is call predict on the object returned by train and pass the data you want predictions for.
There are two ways to pass the data:
- Either pass the Task via the
task
argument or - pass a data frame via the
newdata
argument.
The first way is preferable if you want predictions for data already included in a Task.
Just as train, the predict function has a subset
argument,
so you can set aside different portions of the data in Task for training and prediction
(more advanced methods for splitting the data in train and test set are described in the
section on resampling).
In the following example we fit a gradient boosting machine to every second observation of the BostonHousing data set and make predictions on the remaining data in bh.task.
n = getTaskSize(bh.task)
train.set = seq(1, n, by = 2)
test.set = seq(2, n, by = 2)
lrn = makeLearner("regr.gbm", n.trees = 100)
mod = train(lrn, bh.task, subset = train.set)
task.pred = predict(mod, task = bh.task, subset = test.set)
task.pred
#> Prediction: 253 observations
#> predict.type: response
#> threshold:
#> time: 0.00
#> id truth response
#> 2 2 21.6 22.28539
#> 4 4 33.4 23.33968
#> 6 6 28.7 22.40896
#> 8 8 27.1 22.12750
#> 10 10 18.9 22.12750
#> 12 12 18.9 22.12750
The second way is useful if you want to predict data not included in the Task.
Here we cluster the iris
data set without the target variable.
All observations with an odd index are included in the Task and used for training.
Predictions are made for the remaining observations.
n = nrow(iris)
iris.train = iris[seq(1, n, by = 2), -5]
iris.test = iris[seq(2, n, by = 2), -5]
task = makeClusterTask(data = iris.train)
mod = train("cluster.kmeans", task)
newdata.pred = predict(mod, newdata = iris.test)
newdata.pred
#> Prediction: 75 observations
#> predict.type: response
#> threshold:
#> time: 0.00
#> response
#> 2 2
#> 4 2
#> 6 2
#> 8 2
#> 10 2
#> 12 2
Note that for supervised learning you do not have to remove the target columns from the data.
These columns are automatically removed prior to calling the underlying predict
method of the learner.
Accessing the prediction
The returned Prediction object is a named list. The most important element is data
which is a
data frame that contains columns with the true values of the target variable (in case of
supervised learning problems) and the predictions.
In the following the predictions on the BostonHousing and the iris data sets are shown. As you may recall, the predictions in the first case were made from a Task and in the second case from a data frame.
## Result of predict with data passed via task argument
head(task.pred$data)
#> id truth response
#> 2 2 21.6 22.28539
#> 4 4 33.4 23.33968
#> 6 6 28.7 22.40896
#> 8 8 27.1 22.12750
#> 10 10 18.9 22.12750
#> 12 12 18.9 22.12750
## Result of predict with data passed via newdata argument
head(newdata.pred$data)
#> response
#> 2 2
#> 4 2
#> 6 2
#> 8 2
#> 10 2
#> 12 2
As you can see when predicting from a Task, the resulting data frame
contains an additional column, called id
, which tells us which element in the original data set
the prediction corresponds to.
Extract Probabilities
The predicted probabilities can be extracted from the Prediction using the function getProbabilities. Here is another cluster analysis example. We use fuzzy c-means clustering on the mtcars data set.
lrn = makeLearner("cluster.cmeans", predict.type = "prob")
mod = train(lrn, mtcars.task)
pred = predict(mod, task = mtcars.task)
head(getProbabilities(pred))
#> 1 2
#> Mazda RX4 0.97959529 0.020404714
#> Mazda RX4 Wag 0.97963550 0.020364495
#> Datsun 710 0.99265984 0.007340164
#> Hornet 4 Drive 0.54292079 0.457079211
#> Hornet Sportabout 0.01870622 0.981293776
#> Valiant 0.75746556 0.242534444
For classification problems there are some more things worth mentioning. By default, class labels are predicted.
## Linear discriminant analysis on the iris data set
mod = train("classif.lda", task = iris.task)
pred = predict(mod, task = iris.task)
pred
#> Prediction: 150 observations
#> predict.type: response
#> threshold:
#> time: 0.00
#> id truth response
#> 1 1 setosa setosa
#> 2 2 setosa setosa
#> 3 3 setosa setosa
#> 4 4 setosa setosa
#> 5 5 setosa setosa
#> 6 6 setosa setosa
A confusion matrix can be obtained by calling getConfMatrix.
getConfMatrix(pred)
#> predicted
#> true setosa versicolor virginica -SUM-
#> setosa 50 0 0 0
#> versicolor 0 48 2 2
#> virginica 0 1 49 1
#> -SUM- 0 1 2 3
In order to get predicted posterior probabilities we have to create a Learner
with the appropriate predict.type
.
lrn = makeLearner("classif.rpart", predict.type = "prob")
mod = train(lrn, iris.task)
pred = predict(mod, newdata = iris)
head(pred$data)
#> truth prob.setosa prob.versicolor prob.virginica response
#> 1 setosa 1 0 0 setosa
#> 2 setosa 1 0 0 setosa
#> 3 setosa 1 0 0 setosa
#> 4 setosa 1 0 0 setosa
#> 5 setosa 1 0 0 setosa
#> 6 setosa 1 0 0 setosa
In addition to the probabilities, class labels are predicted by choosing the class with the maximum probability and breaking ties at random.
As mentioned above, the predicted posterior probabilities can be accessed via the getProbabilities function.
head(getProbabilities(pred))
#> setosa versicolor virginica
#> 1 1 0 0
#> 2 1 0 0
#> 3 1 0 0
#> 4 1 0 0
#> 5 1 0 0
#> 6 1 0 0
Adjusting the threshold
We can set the threshold value that is used to map the predicted posterior probabilities to class labels. Note that for this purpose we need to create a Learner that predicts probabilities. For binary classification, the threshold determines when the positive class is predicted. The default is 0.5. Now, we set the threshold for the positive class to 0.9 (that is, an example is assigned to the positive class if its posterior probability exceeds 0.9). Which of the two classes is the positive one can be seen by accessing the Task. To illustrate binary classification, we use the Sonar data set from the mlbench package.
lrn = makeLearner("classif.rpart", predict.type = "prob")
mod = train(lrn, task = sonar.task)
## Label of the positive class
getTaskDescription(sonar.task)$positive
#> [1] "M"
## Default threshold
pred1 = predict(mod, sonar.task)
pred1$threshold
#> M R
#> 0.5 0.5
## Set the threshold value for the positive class
pred2 = setThreshold(pred1, 0.9)
pred2$threshold
#> M R
#> 0.9 0.1
pred2
#> Prediction: 208 observations
#> predict.type: prob
#> threshold: M=0.90,R=0.10
#> time: 0.01
#> id truth prob.M prob.R response
#> 1 1 R 0.1060606 0.8939394 R
#> 2 2 R 0.7333333 0.2666667 R
#> 3 3 R 0.0000000 1.0000000 R
#> 4 4 R 0.1060606 0.8939394 R
#> 5 5 R 0.9250000 0.0750000 M
#> 6 6 R 0.0000000 1.0000000 R
## We can also set the effect in the confusion matrix
getConfMatrix(pred1)
#> predicted
#> true M R -SUM-
#> M 95 16 16
#> R 10 87 10
#> -SUM- 10 16 26
getConfMatrix(pred2)
#> predicted
#> true M R -SUM-
#> M 84 27 27
#> R 6 91 6
#> -SUM- 6 27 33
Note that in the binary case getProbabilities by default extracts the posterior probabilities of the positive class only.
head(getProbabilities(pred1))
#> [1] 0.1060606 0.7333333 0.0000000 0.1060606 0.9250000 0.0000000
## But we can change that, too
head(getProbabilities(pred1, cl = c("M", "R")))
#> M R
#> 1 0.1060606 0.8939394
#> 2 0.7333333 0.2666667
#> 3 0.0000000 1.0000000
#> 4 0.1060606 0.8939394
#> 5 0.9250000 0.0750000
#> 6 0.0000000 1.0000000
It works similarly for multiclass classification. The threshold has to be given by a named vector specifying the values by which each probability will be divided. The class with the maximum resulting value is then selected.
lrn = makeLearner("classif.rpart", predict.type = "prob")
mod = train(lrn, iris.task)
pred = predict(mod, newdata = iris)
pred$threshold
#> setosa versicolor virginica
#> 0.3333333 0.3333333 0.3333333
table(as.data.frame(pred)$response)
#>
#> setosa versicolor virginica
#> 50 54 46
pred = setThreshold(pred, c(setosa = 0.01, versicolor = 50, virginica = 1))
pred$threshold
#> setosa versicolor virginica
#> 0.01 50.00 1.00
table(as.data.frame(pred)$response)
#>
#> setosa versicolor virginica
#> 50 0 100
If you are interested in tuning the threshold (vector) have a look at the section about performance curves and threshold tuning.
Visualizing the prediction
The function plotLearnerPrediction allows to visualize predictions, e.g., for teaching purposes or exploring models. It trains the chosen learning method for 1 or 2 selected features and then displays the predictions with ggplot.
For classification, we get a scatter plot of 2 features (by default the first 2 in the data set). The type of symbol shows the true class labels of the data points. Symbols with white border indicate misclassified observations. The posterior probabilities (if the learner under consideration supports this) are represented by the background color where higher saturation means larger probabilities.
The plot title displays the ID of the Learner (in the following example CART), its parameters, its training performance and its cross-validation performance. mmce stands for mean misclassification error, i.e., the error rate. See the sections on performance and resampling for further explanations.
lrn = makeLearner("classif.rpart", id = "CART")
plotLearnerPrediction(lrn, task = iris.task)
For clustering we also get a scatter plot of two selected features. The color of the points indicates the predicted cluster.
lrn = makeLearner("cluster.SimpleKMeans")
plotLearnerPrediction(lrn, task = mtcars.task, features = c("disp", "drat"), cv = 0)
For regression, there are two types of plots. The 1D plot shows the target values in relation to a single feature, the regression curve and, if the chosen learner supports this, the estimated standard error.
plotLearnerPrediction("regr.lm", features = "lstat", task = bh.task)
The 2D variant, as in the classification case, generates a scatter plot of 2 features.
The fill color of the dots illustrates the value of the target variable "medv"
, the
background colors show the estimated mean.
The plot does not represent the estimated standard error.
plotLearnerPrediction("regr.lm", features = c("lstat", "rm"), task = bh.task)