Parallelization
R by default does not make use of parallelization. With the integration of parallelMap into mlr, it becomes easy to activate the parallel computing capabilities already supported by mlr. parallelMap supports all major parallelization backends: local multicore execution using parallel, socket and MPI clusters using snow, makeshift SSH-clusters using BatchJobs and high performance computing clusters (managed by a scheduler like SLURM, Torque/PBS, SGE or LSF) also using BatchJobs.
All you have to do is select a backend by calling one of the parallelStart* functions. The first loop mlr encounters which is marked as parallel executable will be automatically parallelized. It is good practice to call parallelStop at the end of your script.
library("parallelMap")
parallelStartSocket(2)
#> Starting parallelization in mode=socket with cpus=2.
rdesc = makeResampleDesc("CV", iters = 3)
r = resample("classif.lda", iris.task, rdesc)
#> Exporting objects to slaves for mode socket: .mlr.slave.options
#> Mapping in parallel: mode = socket; cpus = 2; elements = 3.
#> [Resample] Aggr. Result: mmce.test.mean=0.02
parallelStop()
#> Stopped parallelization. All cleaned up.
On Linux or Mac OS X, you may want to use parallelStartMulticore instead.
Parallelization levels
We offer different parallelization levels for fine grained control over the parallelization.
E.g., if you do not want to parallelize the benchmark function because it has only very
few iterations but want to parallelize the resampling of each learner instead,
you can specifically pass the level
"mlr.resample"
to the parallelStart*
function.
Currently the following levels are supported:
parallelGetRegisteredLevels()
#> mlr: mlr.benchmark, mlr.resample, mlr.selectFeatures, mlr.tuneParams
Here is a brief explanation of what these levels do:
"mlr.resample"
: Each resampling iteration (a train / test step) is a parallel job."mlr.benchmark"
: Each experiment "run this learner on this data set" is a parallel job."mlr.tuneParams"
: Each evaluation in hyperparameter space "resample with these parameter settings" is a parallel job. How many of these can be run independently in parallel, depends on the tuning algorithm. For grid search or random search this is no problem, but for other tuners it depends on how many points are produced in each iteration of the optimization. If a tuner works in a purely sequential fashion, we cannot work magic and the hyperparameter evaluation will also run sequentially. But note that you can still parallelize the underlying resampling."mlr.selectFeatures"
: Each evaluation in feature space "resample with this feature subset" is a parallel job. The same comments as for"mlr.tuneParams"
apply here.
Custom learners and parallelization
If you have implemented a custom learner yourself, locally, you currently need to export this to the slave. So if you see an error after calling, e.g., a parallelized version of resample like this:
no applicable method for 'trainLearner' applied to an object of class <my_new_learner>
simply add the following line somewhere after calling parallelStart.
parallelExport("trainLearner.<my_new_learner>", "predictLearner.<my_new_learner>")
The end
For further details, consult the parallelMap tutorial and help.