Magic potions to clean and transform your data 🧙
BREAKING CHANGES
The following re-exported functions from {insight}
have now been removed:
object_has_names()
, object_has_rownames()
, is_empty_object()
,
compact_list()
, compact_character()
.
Argument na.rm
was renamed to remove_na
throughout {datawizard}
functions.
na.rm
is kept for backward compatibility, but will be deprecated and later
removed in future updates.
The way expressions are defined in data_filter()
was revised. The filter
argument was replaced by ...
, allowing to separate multiple expression with
a comma (which are then combined with &
). Furthermore, expressions can now also be
defined as strings, or be provided as character vectors, to allow string-friendly
programming.
CHANGES
Weighted-functions (weighted_sd()
, weighted_mean()
, ...) gain a remove_na
argument, to remove or keep missing and infinite values. By default,
remove_na = TRUE
, i.e. missing and infinite values are removed by default.
reverse_scale()
, normalize()
and rescale()
gain an append
argument
(similar to other data frame methods of transformation functions), to append
recoded variables to the input data frame instead of overwriting existing
variables.
NEW FUNCTIONS
rowid_as_column()
to complement rownames_as_column()
(and to mimic
tibble::rowid_to_column()
). Note that its behavior is different from
tibble::rowid_to_column()
for grouped data. See the Details section in the
docs.
data_unite()
, to merge values of multiple variables into one new variable.
data_separate()
, as counterpart to data_unite()
, to separate a single
variable into multiple new variables.
data_modify()
, to create new variables, or modify or remove existing
variables in a data frame.
MINOR CHANGES
to_numeric()
for variables of type Date
, POSIXct
and POSIXlt
now
includes the class name in the warning message.
Added a print()
method for center()
, standardize()
, normalize()
and
rescale()
.
BUG FIXES
standardize_parameters()
now works when the package namespace is in the model
formula (#401).
data_merge()
no longer yields a warning for tibbles
when join = "bind"
.
center()
and standardize()
did not work for grouped data frames (of class
grouped_df
) when force = TRUE
.
The data.frame
method of describe_distribution()
returns NULL
instead of
an error if no valid variable were passed (for example a factor variable with
include_factors = FALSE
) (#421).
BREAKING CHANGES
add_labs()
was renamed into assign_labels()
. Since add_labs()
existed
only for a few days, there will be no alias for backwards compatibility.NEW FUNCTIONS
labels_to_levels()
, to use value labels of factors as their levels.MINOR CHANGES
data_read()
now checks if the imported object actually is a data frame (or
coercible to a data frame), and if not, no longer errors, but gives an
informative warning of the type of object that was imported.BUG FIXES
BREAKING CHANGES
In selection patterns, expressions like -var1:var3
to exclude all variables
between var1
and var3
are no longer accepted. The correct expression is
-(var1:var3)
. This is for 2 reasons:
-1:2
is not accepted but
-(1:2)
is);dplyr::select()
, which throws a warning and only
uses the first variable in the first expression.NEW FUNCTIONS
recode_into()
, similar to dplyr::case_when()
, to recode values from one
or more variables into a new variable.
mean_sd()
and median_mad()
for summarizing vectors to their mean (or
median) and a range of one SD (or MAD) above and below.
data_write()
as counterpart to data_read()
, to write data frames into
CSV, SPSS, SAS, Stata files and many other file types. One advantage over
existing functions to write data in other packages is that labelled (numeric)
data can be converted into factors (with values labels used as factor levels)
even for text formats like CSV and similar. This allows exporting "labelled"
data into those file formats, too.
add_labs()
, to manually add value and variable labels as attributes to
variables. These attributes are stored as "label"
and "labels"
attributes,
similar to the labelled
class from the haven package.
MINOR CHANGES
data_rename()
gets a verbose
argument.winsorize()
now errors if the threshold is incorrect (previously, it provided
a warning and returned the unchanged data). The argument verbose
is now
useless but is kept for backward compatibility. The documentation now containsthreshold
(#357).select
and/or exclude
, there is now
one warning per misspelled variable. The previous behavior was to have only one
warning.standardize()
when only one of the arguments
center
or scale
were provided (#365).unstandardize()
and replace_nan_inf()
now work with select helpers (#376).reverse()
. Furthermore, the
docs now describe the range
argument more clearly (#380).unnormalize()
errors with unexpected inputs (#383).BUG FIXES
empty_columns()
(and therefore remove_empty_columns()
) now correctly detects
columns containing only NA_character_
(#349).select
(#356).convert_na_to()
when select
is a list (#352).MAJOR CHANGES
MINOR CHANGES
standardize()
, center()
, normalize()
and rescale()
can be used in
model formulas, similar to base::scale()
.
data_codebook()
now includes the proportion for each category/value, in
addition to the counts. Furthermore, if data contains tagged NA
values,
these are included in the frequency table.
BUG FIXES
center(x)
now works correctly when x
is a single value and either
reference
or center
is specified (#324).
Fixed issue in data_codebook()
, which failed for labelled vectors when
values of labels were not in sorted order.
NEW FUNCTIONS
data_codebook()
: to generate codebooks of data frames.
New functions to deal with duplicates: data_duplicated()
(keep all duplicates,
including the first occurrence) and data_unique()
(returns the data, excluding
all duplicates except one instance of each, based on the selected method).
MINOR CHANGES
.data.frame
methods should now preserve custom attributes.
The include_bounds
argument in normalize()
can now also be a numeric
value, defining the limit to the upper and lower bound (i.e. the distance
to 1 and 0).
data_filter()
now works with grouped data.
BUG FIXES
data_read()
no longer prints message for empty columns when the data
actually had no empty columns.
data_to_wide()
now drops columns that are not in id_cols
(if specified),
names_from
, or values_from
. This is the behaviour observed in tidyr::pivot_wider()
.
MAJOR CHANGES
There is a new publication about the {datawizard}
package:
https://joss.theoj.org/papers/10.21105/joss.04684
Fixes failing tests due to changes in R-devel
.
data_to_long()
and data_to_wide()
have had significant performance
improvements, sometimes as high as a ten-fold speedup.
MINOR CHANGES
When column names are misspelled, most functions now suggest which existing columns possibly could be meant.
Miscellaneous performance gains.
convert_to_na()
now requires argument na
to be of class 'Date' to convert
specific dates to NA
. For example, convert_to_na(x, na = "2022-10-17")
must be changed to convert_to_na(x, na = as.Date("2022-10-17"))
.
BUG FIXES
data_to_long()
and data_to_wide()
now correctly keep the date
format.Methods for grouped data frames (.grouped_df
) no longer support
dplyr::group_by()
for {dplyr}
before version 0.8.0
.
empty_columns()
and remove_empty_columns()
now also remove columns that
contain only empty characters. Likewise, empty_rows()
and
remove_empty_rows()
remove observations that completely have missing or
empty character values.
data_arrange()
now works with data frames that were grouped using
data_group()
(#274).
data_read()
gains a convert_factors
argument, to turn off automatic
conversion from numeric variables into factors.
{tidyselect}
package (#267).BREAKING CHANGES
The minimum needed R version has been bumped to 3.6
.
Following deprecated functions have been removed:
data_cut()
, data_recode()
, data_shift()
, data_reverse()
, data_rescale()
,
data_to_factor()
, data_to_numeric()
New text_format()
alias is introduced for format_text()
, latter of which
will be removed in the next release.
New recode_values()
alias is introduced for change_code()
, latter of which
will be removed in the next release.
data_merge()
now errors if columns specified in by
are not in both datasets.
Using negative values in arguments select
and exclude
now removes the columns
from the selection/exclusion. The previous behavior was to start the
selection/exclusion from the end of the dataset, which was inconsistent with
the use of "-" with other selecting possibilities.
NEW FUNCTIONS
data_peek()
: to peek at values and type of variables in a data frame.
coef_var()
: to compute the coefficient of variation.
CHANGES
data_filter()
will give more informative messages on malformed syntax of
the filter
argument.
It is now possible to use curly brackets to pass variable names to data_filter()
,
like the following example. See examples section in the documentation of
data_filter()
.
The regex
argument was added to functions that use select-helpers and did
not already have this argument.
Select helpers starts_with()
, ends_with()
, and contains()
now accept
several patterns, e.g starts_with("Sep", "Petal")
.
Arguments select
and exclude
that are present in most functions have been
improved to work in loops and in custom functions. For example, the following
code now works:
foo <- function(data) {
i <- "Sep"
find_columns(data, select = starts_with(i))
}
foo(iris)
for (i in c("Sepal", "Sp")) {
head(iris) |>
find_columns(select = starts_with(i)) |>
print()
}
{datawizard}
functions.{poorman}
update