Creating an Imputation Method

Function makeImputeMethod allows to create your own imputation method. For this purpose you need to specify a learn function that extracts the necessary information and an impute function that does the actual imputation. The learn and impute functions both have at least the following arguments:

Let's have a look at function imputeMean.

imputeMean
#> function () 
#> {
#>     makeImputeMethod(learn = function(data, target, col) mean(data[[col]], 
#>         na.rm = TRUE), impute = simpleImpute)
#> }
#> <bytecode: 0x7fa7dbbea4e8>
#> <environment: namespace:mlr>
mlr:::simpleImpute
#> function (data, target, col, const) 
#> {
#>     if (is.na(const)) 
#>         stopf("Error imputing column '%s'. Maybe all input data was missing?", 
#>             col)
#>     x = data[[col]]
#>     if (is.factor(x) && const %nin% levels(x)) {
#>         levels(x) = c(levels(x), as.character(const))
#>     }
#>     replace(x, is.na(x), const)
#> }
#> <bytecode: 0x7fa7dc5fdff8>
#> <environment: namespace:mlr>

The learn function calculates the mean of the non-missing observations in column col. The mean is passed via argument const to the impute function that replaces all NA's in feature col.

Now let's write a new imputation method: In case of longitudinal data a frequently used technique is last observation carried forward (LOCF) where missing values are replaced by the most recent observed value.

In the R code below the learn function determines the last observed value previous to each NA (values) as well as the corresponding number of consecutive NA's (times). The impute function generates a vector where the entries in values are replicated according to times and replaces the NA's in feature col.

imputeLOCF = function() {
    makeImputeMethod(
      learn = function(data, target, col) {
        x = data[[col]]
        ind = is.na(x)
        dind = diff(ind)
        first = which(dind == 1)     # position of the last observed value previous to NA
        last = which(dind == -1)     # position of the last of potentially several consecutive NA's
        values = x[first]            # observed value previous to NA
        times = last - first         # number of consecutive NA's
        return(list(values = values, times = times))
      },
      impute = function(data, target, col, values, times) {
        x = data[[col]]
        replace(x, is.na(x), rep(values, times))
      }
    )
}

In the following the missing values in features Ozone and Solar.R in the airquality data set are imputed by LOCF.

data(airquality)
imp = impute(airquality, cols = list(Ozone = imputeLOCF(), Solar.R = imputeLOCF()),
  dummy.cols = c("Ozone", "Solar.R"))
head(imp$data, 10)
#>    Ozone Solar.R Wind Temp Month Day Ozone.dummy Solar.R.dummy
#> 1     41     190  7.4   67     5   1       FALSE         FALSE
#> 2     36     118  8.0   72     5   2       FALSE         FALSE
#> 3     12     149 12.6   74     5   3       FALSE         FALSE
#> 4     18     313 11.5   62     5   4       FALSE         FALSE
#> 5     18     313 14.3   56     5   5        TRUE          TRUE
#> 6     28     313 14.9   66     5   6       FALSE          TRUE
#> 7     23     299  8.6   65     5   7       FALSE         FALSE
#> 8     19      99 13.8   59     5   8       FALSE         FALSE
#> 9      8      19 20.1   61     5   9       FALSE         FALSE
#> 10     8     194  8.6   69     5  10        TRUE         FALSE