Use any of these functions to quickly clean columns in your data set. Use clean()
to pick the functions that return the least relative number of NA
s. They always return the class from the function name (e.g. clean_Date()
always returns class Date
).
clean(x)
# S3 method for data.frame
clean(x)
clean_logical(
x,
true = regex_true(),
false = regex_false(),
na = NULL,
fixed = FALSE,
ignore.case = TRUE
)
clean_factor(
x,
levels = unique(x),
ordered = FALSE,
droplevels = FALSE,
fixed = FALSE,
ignore.case = TRUE
)
clean_numeric(x, remove = "[^0-9.,-]", fixed = FALSE)
clean_double(x, remove = "[^0-9.,-]", fixed = FALSE)
clean_integer(x, remove = "[^0-9.,-]", fixed = FALSE)
clean_character(
x,
remove = "[^a-z \t\r\n]",
fixed = FALSE,
ignore.case = TRUE,
trim = TRUE
)
clean_currency(x, currency_symbol = NULL, remove = "[^0-9.,-]", fixed = FALSE)
clean_percentage(x, remove = "[^0-9.,-]", fixed = FALSE)
clean_Date(x, format = NULL, guess_each = FALSE, max_date = Sys.Date(), ...)
clean_POSIXct(
x,
tz = "",
remove = "[^.0-9 :/-]",
fixed = FALSE,
max_date = Sys.Date(),
...
)
Triennial Central Bank Survey Foreign exchange turnover in April 2016 (PDF). Bank for International Settlements. 11 December 2016. p. 10.
data to clean
regex to interpret values as TRUE
(which defaults to regex_true
), see Details
regex to interpret values as FALSE
(which defaults to regex_false
), see Details
regex to force interpret values as NA
, i.e. not as TRUE
or FALSE
logical to indicate whether regular expressions should be turned off
logical to indicate whether matching should be case-insensitive
new factor levels, may be named with regular expressions to match existing values, see Details
logical to indicate whether the factor levels should be ordered
logical to indicate whether non-existing factor levels should be dropped
regex to define the character(s) that should be removed, see Details
logical to indicate whether the result should be trimmed with trimws(..., which = "both")
the currency symbol to use, which will be guessed based on the input and otherwise defaults to the current system locale setting (see Sys.localeconv
)
character string giving a date-time format as used by strptime.
For clean_Date(..., guess_each = TRUE)
, this can be a vector of values to be used for guessing, see Examples.
logical to indicate whether all items of x
should be guessed one by one, see Examples
date (coercible with [as.Date()]) to indicate to maximum allowed of x
, which defaults to today. This is to prevent that clean_Date("23-03-47")
will return 23 March 2047 and instead returns 23 March 1947 with a warning.
for clean_Date
and clean_POSIXct
: other parameters passed on these functions
time zone specification to be used for the conversion,
if one is required. System-specific (see time zones),
but ""
is the current time zone, and "GMT"
is UTC
(Universal Time, Coordinated). Invalid values are most commonly
treated as UTC, on some platforms with a warning.
The clean_*
functions always return the class from the function name:
clean_logical()
: class logical
clean_factor()
: class factor
clean_numeric()
and clean_double()
: class numeric
clean_integer()
: class integer
clean_character()
: class character
clean_percentage()
: class percentage
clean_currency()
: class currency
clean_Date()
: class Date
clean_POSIXct()
: classes POSIXct/POSIXt
Using clean()
on a vector will guess a cleaning function based on the potential number of NAs
it returns. Using clean()
on a data.frame to apply this guessed cleaning over all columns.
Info about the different functions:
clean_logical()
:
Use parameters true
and false
to match values using case-insensitive regular expressions (regex). Unmatched values are considered NA
. At default, values are matched with regex_true
and regex_false
. This allows support for values "Yes" and "No" in the following languages: Arabic, Bengali, Chinese (Mandarin), Dutch, English, French, German, Hindi, Indonesian, Japanese, Malay, Portuguese, Russian, Spanish, Telugu, Turkish and Urdu. Use parameter na
to override values as NA
that would else be matched with true
or false
. See Examples.
clean_factor()
:
Use parameter levels
to set new factor levels. They can be case-insensitive regular expressions to match existing values of x
. For matching, new values for levels
are internally temporary sorted descending on text length. See Examples.
clean_numeric()
, clean_double()
, clean_integer()
and clean_character()
:
Use parameter remove
to match values that must be removed from the input, using regular expressions (regex). In case of clean_numeric()
, comma's will be read as dots and only the last dot will be kept. Function clean_character()
will keep middle spaces at default. See Examples.
clean_percentage()
:
This new class works like clean_numeric()
, but transforms it with as.percentage
, which will retain the original values, but will print them as percentages. See Examples.
clean_currency()
:
This new class works like clean_numeric()
, but transforms it with as.currency
. The currency symbol is guessed based on the most traded currencies by value (see Source): the United States dollar, Euro, Japanese yen, Pound sterling, Swiss franc, Renminbi, Swedish krona, Mexican peso, South Korean won, Turkish lira, Russian ruble, Indian rupee and the South African rand. See Examples.
clean_Date()
:
Use parameter format
to define a date format, or leave it empty to have the format guessed. Use "Excel"
to read values as Microsoft Excel dates. The format
parameter will be evaluated with format_datetime
, which means that a format like "d-mmm-yy"
with be translated internally to "%e-%b-%y"
for convenience. See Examples.
clean_POSIXct()
:
Use parameter remove
to match values that must be removed from the input, using regular expressions (regex). The resulting string will be coerced to a date/time element with class POSIXct
, using as.POSIXct()
. See Examples.
The use of invalid regular expressions in any of the above functions will not return an error (like in base R), but will instead interpret the expression as a fixed value and will throw a warning.
clean_logical(c("Yes", "No")) # English
#> [1] TRUE FALSE
clean_logical(c("Oui", "Non")) # French
#> [1] TRUE FALSE
clean_logical(c("ya", "tidak")) # Indonesian
#> [1] TRUE FALSE
clean_logical(x = c("Positive", "Negative", "Unknown", "Some value"),
true = "pos", false = "neg")
#> [1] TRUE FALSE NA NA
gender_age <- c("male 0-50", "male 50+", "female 0-50", "female 50+")
clean_factor(gender_age, c("M", "F"))
#> [1] M M F F
#> Levels: M F
clean_factor(gender_age, c("Male", "Female"))
#> [1] Male Male Female Female
#> Levels: Male Female
clean_factor(gender_age, c("0-50", "50+"), ordered = TRUE)
#> [1] 0-50 50+ 0-50 50+
#> Levels: 0-50 < 50+
clean_Date("13jul18", "ddmmmyy")
#> [1] "2018-07-13"
clean_Date("12 August 2010")
#> (assuming format 'dd mmmm yyyy')
#> [1] "2010-08-12"
clean_Date("12 06 2012")
#> (assuming format 'dd mm yyyy')
#> [1] "2012-06-12"
clean_Date("October 1st 2012")
#> (assuming format 'mmmm d yyyy')
#> [1] "2012-10-01"
clean_Date("43658")
#> (assuming Excel format)
#> [1] "2019-07-12"
clean_Date("14526", "Excel")
#> [1] "1939-10-08"
clean_Date(c("1 Oct 13", "October 1st 2012")) # could not be fitted in 1 format
#> Warning: Try guess_each = TRUE to guess the format for each value.
#> Date/time format could not be determined automatically, returning NAs
#> [1] NA NA
clean_Date(c("1 Oct 13", "October 1st 2012"), guess_each = TRUE)
#> [1] "2013-10-01" "2012-10-01"
clean_Date(c("12-14-13", "1 Oct 2012"),
guess_each = TRUE,
format = c("d mmm yyyy", "mm-yy-dd")) # only these formats will be tried
#> (format 'd mmm yyyy' used for 1 items)
#> (format 'mm-yy-dd' used for 1 items)
#> [1] "2014-12-13" "2012-10-01"
clean_POSIXct("Created log on 2020/02/11 11:23 by user Joe")
#> [1] "2020-02-11 11:23:00 UTC"
clean_POSIXct("Created log on 2020.02.11 11:23 by user Joe", tz = "UTC")
#> [1] "2020-02-11 11:23:00 UTC"
clean_numeric("qwerty123456")
#> [1] 123456
clean_numeric("Positive (0.143)")
#> [1] 0.143
clean_numeric("0,143")
#> [1] 0.143
clean_numeric("minus 12 degrees")
#> [1] -12
clean_percentage("PCT: 0.143")
#> [1] 0.143%
clean_percentage(c("Total of -12.3%", "Total of +4.5%"))
#> [1] -12.3% 4.5%
clean_character("qwerty123456")
#> [1] "qwerty"
clean_character("Positive (0.143)")
#> [1] "Positive"
clean_currency(c("Received 25", "Received 31.40"))
#> [1] `25.00` `31.40`
clean_currency(c("Jack sent £ 25", "Bill sent £ 31.40"))
#> [1] `GBP 25.00` `GBP 31.40`
df <- data.frame(A = c("2 Apr 2016", "5 Feb 2020"),
B = c("yes", "no"),
C = c("Total of -12.3%", "Total of +4.5%"),
D = c("Marker: 0.4513 mmol/l", "Marker: 0.2732 mmol/l"))
df
#> A B C D
#> 1 2 Apr 2016 yes Total of -12.3% Marker: 0.4513 mmol/l
#> 2 5 Feb 2020 no Total of +4.5% Marker: 0.2732 mmol/l
clean(df)
#> Note: Assuming class 'Date' for variable 'A'
#> (assuming format 'd mmm yyyy')
#> Note: Assuming class 'logical' for variable 'B'
#> Note: Assuming class 'percentage' for variable 'C'
#> Note: Assuming class 'numeric' for variable 'D'
#> A B C D
#> 1 2016-04-02 TRUE -12.3% 0.4513
#> 2 2020-02-05 FALSE 4.5% 0.2732