
Recode and Replace Values in Matrix-Like Objects
recode-replace.Rd
A small suite of functions to efficiently perform common recoding and replacing tasks in matrix-like objects (vectors, matrices, arrays, data frames, lists of atomic objects):
recode_num
andrecode_char
can be used to efficiently recode multiple numeric or character values, respectively. The syntax is inspired bydplyr::recode
, but the functionality is enhanced in the following respects: (1) they are faster thandplyr::recode
, (2) when passed a data frame / list, all appropriately typed columns will be recoded. (3) They preserve the attributes of the data object and of columns in a data frame / list, and (4)recode_char
also supports regular expression matching usinggrepl
.replace_NA
efficiently replacesNA/NaN
with a value (default is0L
). data can be multi-typed, in which case appropriate columns can be selected through thecols
argument. For numeric data a more versatile alternative is provided bydata.table::nafill
anddata.table::setnafill
.replace_Inf
replacesInf/-Inf
(or optionallyNaN/Inf/-Inf
) with a value (default isNA
).replace_Inf
skips non-numeric columns in a data frame.replace_outliers
replaces values falling outside a 1- or 2-sided numeric threshold or outside a certain number of standard deviations with a value (default isNA
).replace_outliers
skips non-numeric columns in a data frame.
Usage
recode_num(X, ..., default = NULL, missing = NULL, set = FALSE)
recode_char(X, ..., default = NULL, missing = NULL, regex = FALSE,
ignore.case = FALSE, fixed = FALSE, set = FALSE)
replace_NA(X, value = 0L, cols = NULL, set = FALSE)
replace_Inf(X, value = NA, replace.nan = FALSE)
replace_outliers(X, limits, value = NA,
single.limit = c("SDs", "min", "max", "overall_SDs"))
Arguments
- X
a vector, matrix, array, data frame or list of atomic objects.
- ...
comma-separated recode arguments of the form:
value = replacement, `2` = 0, Secondary = "SEC"
etc..recode_char
withregex = TRUE
also supports regular expressions i.e.`^S|D$` = "STD"
etc.- default
optional argument to specify a scalar value to replace non-matched elements with.
- missing
optional argument to specify a scalar value to replace missing elements with. Note that to increase efficiency this is done before the rest of the recoding i.e. the recoding is performed on data where missing values are filled!
- set
logical.
TRUE
does (some) replacements by reference (i.e. in-place modification of the data). Forreplace_NA
this feature is mature, and the result will be returned invisibly. Forrecode_num
andrecode_char
, replacement by reference is still partial, so you need to assign the result to an object to materialize all changes.- regex
logical. If
TRUE
, all recode-argument names are (sequentially) passed togrepl
as a pattern to searchX
. All matches are replaced. Note thatNA
's are also matched as strings bygrepl
.- value
a single (scalar) value to replace matching elements with.
- cols
select columns to replace missing values in using a function, column names, indices or logical vector.
- replace.nan
logical.
TRUE
replacesNaN/Inf/-Inf
.FALSE
(default) replaces onlyInf/-Inf
.- limits
either a vector of two-numeric values
c(minval, maxval)
constituting a two-sided outlier threshold, or a single numeric value constituting either a factor of standard deviations (default), or the minimum or maximum of a one-sided outlier threshold. See alsosingle.limit
.- single.limit
a character or integer (argument only applies if
length(limits) == 1
):1 - "SDs"
specifies thatlimits
will be interpreted as a (two-sided) threshold in column standard-deviations on standardized data. The underlying code is equivalent toX[abs(fscale(X)) > limits] <- value
but faster. Sincefscale
is S3 generic with methods forgrouped_df
,pseries
andpdata.frame
, the standardizing will be grouped if such objects are passed (i.e. the outlier threshold is then measured in within-group standard deviations).2 - "min"
specifies thatlimits
will be interpreted as a (one-sided) minimum threshold. The underlying code is equivalent toX[X < limits] <- value
.3 - "max"
specifies thatlimits
will be interpreted as a (one-sided) maximum threshold. The underlying code is equivalent toX[X > limits] <- value
.4 - "overall_SDs"
is equivalent to "SDs" but ignores groups when agrouped_df
,pseries
orpdata.frame
is passed (i.e. standardizing and determination of outliers is by the overall column standard deviation).
- ignore.case, fixed
logical. Passed to
grepl
and only applicable ifregex = TRUE
.
Note
These functions are not generic and do not offer support for factors or date(-time) objects. see dplyr::recode_factor
, forcats and other appropriate packages for dealing with these classes.
Simple replacing tasks on a vector can also effectively be handled by, setv
/ copyv
. Fast vectorized switches are offered by package kit (functions iif
, nif
, vswitch
, nswitch
) as well as data.table::fcase
and data.table::fifelse
.
Examples
recode_char(c("a","b","c"), a = "b", b = "c")
#> [1] "b" "c" "c"
recode_char(month.name, ber = NA, regex = TRUE)
#> [1] "January" "February" "March" "April" "May" "June"
#> [7] "July" "August" NA NA NA NA
mtcr <- recode_num(mtcars, `0` = 2, `4` = Inf, `1` = NaN)
replace_Inf(mtcr)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 2 NaN NA NA
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 2 NaN NA NA
#> Datsun 710 22.8 NA 108 93 3.85 2.320 18.61 NaN NaN NA NaN
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 NaN 2 3 NaN
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 2 2 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 NaN 2 3 NaN
#> [ reached 'max' / getOption("max.print") -- omitted 26 rows ]
replace_Inf(mtcr, replace.nan = TRUE)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 2 NA NA NA
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 2 NA NA NA
#> Datsun 710 22.8 NA 108 93 3.85 2.320 18.61 NA NA NA NA
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 NA 2 3 NA
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 2 2 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 NA 2 3 NA
#> [ reached 'max' / getOption("max.print") -- omitted 26 rows ]
replace_outliers(mtcars, c(2, 100)) # Replace all values below 2 and above 100 w. NA
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 NA NA 3.90 2.620 16.46 NA NA 4 4
#> Mazda RX4 Wag 21.0 6 NA NA 3.90 2.875 17.02 NA NA 4 4
#> Datsun 710 22.8 4 NA 93 3.85 2.320 18.61 NA NA 4 NA
#> Hornet 4 Drive 21.4 6 NA NA 3.08 3.215 19.44 NA NA 3 NA
#> Hornet Sportabout 18.7 8 NA NA 3.15 3.440 17.02 NA NA 3 2
#> Valiant 18.1 6 NA NA 2.76 3.460 20.22 NA NA 3 NA
#> [ reached 'max' / getOption("max.print") -- omitted 26 rows ]
replace_outliers(mtcars, 2, single.limit = "min") # Replace all value smaller than 2 with NA
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 NA NA 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 NA NA 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 NA NA 4 NA
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 NA NA 3 NA
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 NA NA 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 NA NA 3 NA
#> [ reached 'max' / getOption("max.print") -- omitted 26 rows ]
replace_outliers(mtcars, 100, single.limit = "max") # Replace all value larger than 100 with NA
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 NA NA 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 NA NA 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 NA 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 NA NA 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 NA NA 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 NA NA 2.76 3.460 20.22 1 0 3 1
#> [ reached 'max' / getOption("max.print") -- omitted 26 rows ]
replace_outliers(mtcars, 2) # Replace all values above or below 2 column-
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
#> [ reached 'max' / getOption("max.print") -- omitted 26 rows ]
# standard-deviations from the column-mean w. NA
replace_outliers(fgroup_by(iris, Species), 2) # Passing a grouped_df, pseries or pdata.frame
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5.0 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> 11 5.4 3.7 1.5 0.2 setosa
#> 12 4.8 3.4 1.6 0.2 setosa
#> 13 4.8 3.0 1.4 0.1 setosa
#> 14 NA 3.0 NA 0.1 setosa
#> [ reached 'max' / getOption("max.print") -- omitted 136 rows ]
#>
#> Grouped by: Species [3 | 50 (0)]
# allows to remove outliers according to
# in-group standard-deviation. see ?fscale