SebKrantz Collapse Versions Save

Advanced and Fast Data Transformation in R

v2.0.13

1 month ago

collapse now explicitly supports xts/zoo and units objects and concurrently removes an additional check in the .default method of statistical functions that called the matrix method if is.matrix(x) && !inherits(x, "matrix"). This was a smart solution to account for the fact that xts objects are matrix-based but don't inherit the "matrix" class, thus wrongly calling the default method. The same is the case for units, but here, my recent more intensive engagement with spatial data convinced me that this should be changed. For one, under the previous heuristic solution, it was not possible to call the default method on a units matrix, e.g., fmean.default(st_distance(points_sf)) called fmean.matrix() and yielded a vector. This should not be the case. Secondly, aggregation e.g. fmean(st_distance(points_sf)) or fmean(st_distance(points_sf), g = group_vec) yielded a plain numeric object that lost the units class (in line with the general attribute handling principles). Therefore, I have now decided to remove the heuristic check within the default methods, and explicitly support zoo and units objects. For Fast Statistical Functions, the methods are FUN.zoo <- function(x, ...) if(is.matrix(x)) FUN.matrix(x, ...) else FUN.default(x, ...) and FUN.units <- function(x, ...) if(is.matrix(x)) copyMostAttrib(FUN.matrix(x, ...), x) else FUN.default(x, ...). While the behavior for xts/zoo remains the same, the behavior for units is enhanced, as now the class is preserved in aggregations (the .default method preserves attributes except for ts), and it is possible to manually invoke the .default method on a units matrix and obtain an aggregate statistic. This change may impact computations on other matrix based classes which don't inherit from "matrix" (mts does inherit from "matrix", and I am not aware of any other affected classes, but user code like m <- matrix(rnorm(25), 5); class(m) <- "bla"; fmean(m) will now yield a scalar instead of a vector. Such code must be adjusted to either class(m) <- c("bla", "matrix") or fmean.matrix(m)). Overall, the change makes collapse behave in a more standard and predictable way, and enhances its support for units objects central in the sf ecosystem.
fquantile() now also preserves the attributes of the input, in line with quantile().

v2.0.12

1 month ago

Fixes some issues with signed int overflows inside hash functions and possible protect bugs flagged by RCHK. With few exceptions these fixes are cosmetic to appease the C/C++ code checks on CRAN.

v2.0.11

2 months ago

An article on collapse has been submitted to the Journal of Statistical Software. The preprint is available through arXiv.
Removed magrittr from most documentation examples (using base pipe).
Improved plot.GRP a little bit - on request of JSS editors.

v2.0.10

3 months ago

Fixed a bug in fmatch() when matching integer vectors to factors. This also affected join().
Improved cross-platform compatibility of OpenMP flags. Thanks @kalibera.
Added stub = TRUE argument to the grouped_df methods of Fast Statistical Functions supporting weights, to be able to remove or alter prefixes given to aggregated weights columns if keep.w = TRUE. Globally, users can set st_collapse(stub = FALSE) to disable this prefixing in all statistical functions and operators.

v2.0.9

4 months ago

Added functions na_locf() and na_focb() for fast basic C implementations of these procedures (optionally by reference). replace_na() now also has a type argument which supports options "locf" and "focb" (default "const"), similar to data.table::nafill. The implementation also supports character data and list-columns (NULL/empty elements). Thanks @BenoitLondon for suggesting (#489). I note that na_locf() exists in some other packages (such as imputeTS) where it is implemented in R and has additional options. Users should utilize the flexible namespace i.e. set_collapse(remove = "na_locf") to deal with this.
Fixed a bug in weighted quantile estimation (fquantile()) that could lead to wrong/out-of-range estimates in some cases. Thanks @zander-prinsloo for reporting (#523).
Improved right join such that join column names of x instead of y are preserved. This is more consistent with the other joins when join columns in x and y have different names.
More fluent and safe interplay of 'mask' and 'remove' options in set_collapse(): it is now seamlessly possible to switch from any combination of 'mask' and 'remove' to any other combination without the need of setting them to NULL first.

v2.0.8

4 months ago

In pivot(..., values = [multiple columns], labels = "new_labels_column", how = "wieder"), if the columns selected through values already have variable labels, they are concatenated with the new labels provided through "new_labels_col" using " - " as a separator (similar to names where the separator is "_").
whichv() and operators %==%, %!=% now properly account for missing double values, e.g. c(NA_real_, 1) %==% c(NA_real_, 1) yields c(1, 2) rather than 2. Thanks @eutwt for flagging this (#518).
In setv(X, v, R), if the type of R is greater than X e.g. setv(1:10, 1:3, 9.5), then a warning is issued that conversion of R to the lower type (real to integer in this case) may incur loss of information. Thanks @tony-aw for suggesting (#498).
frange() has an option finite = FALSE, like base::range. Thanks @MLopez-Ibanez for suggesting (#511).
varying.pdata.frame(..., any_group = FALSE) now unindexes the result (as should be the case).

v2.0.7

5 months ago

Fixed bug in full join if verbose = 0. Thanks @zander-prinsloo for reporting.
Added argument multiple = FALSE to join(). Setting multiple = TRUE performs a multiple-matching join where a row in x is matched to all matching rows in y. The default FALSE just takes the first matching row in y.
Improved recode/replace functions. Notably, replace_outliers() now supports option value = "clip" to replace outliers with the respective upper/lower bounds, and also has option single.limit = "mad" which removes outliers exceeding a certain number of median absolute deviations. Furthermore, all functions now have a set argument which fully applies the transformations by reference.
Functions replace_NA and replace_Inf were renamed to replace_na and replace_inf to make the namespace a bit more consistent. The earlier versions remain available.

v2.0.6

6 months ago

Fixed a serious bug in qsu() where higher order weighted statistics were erroneous, i.e. whenever qsu(x, ..., w = weights, higher = TRUE) was invoked, the 'SD', 'Skew' and 'Kurt' columns were wrong (if higher = FALSE the weighted 'SD' is correct). The reason is that there appears to be no straightforward generalization of Welford's Online Algorithm to higher-order weighted statistics. This was not detected earlier because the algorithm was only tested with unit weights. The fix involved replacing Welford's Algorithm for the higher-order weighted case by a 2-pass method, that additionally uses long doubles for higher-order terms. Thanks @randrescastaneda for reporting.
Fixed some unexpected behavior in t_list() where names 'V1', 'V2', etc. were assigned to unnamed inner lists. It now preserves the missing names. Thanks @orgadish for flagging this.

v2.0.5

6 months ago

In join, the if y is an expression e.g. join(x = mtcars, y = subset(mtcars, mpg > 20)), then its name is not extracted but just set to "y". Before, the name of y would be captured as as.character(substitute(y))[1] = "subset" in this case. This is an improvement mainly for display purposes, but could also affect code if there are duplicate columns in both datasets and suffix was not provided in the join call: before, y-columns would be renamed using a (non-sensible) "_subset" suffix, but now using a "_y" suffix. Note that this only concerns cases where y is an expression rather than a single object.
Small performance improvements to %[!]in% operators: %!in% now uses is.na(fmatch(x, table)) rather than fmatch(x, table, 0L) == 0L, and %in%, if exported using set_collapse(mask = "%in%"|"special"|"all") is as.logical(fmatch(x, table, 0L)) instead of fmatch(x, table, 0L) > 0L. The latter are faster because comparison operators >, == with integers additionally need to check for NA's (= the smallest integer in C).

v2.0.4

6 months ago

In fnth()/fquantile(), there has been a slight change to the weighted quantile algorithm. As outlined in the documentation, this algorithm gives weighted versions for all continuous quantile methods (type 7-9) in R by replacing sample quantities with their weighted counterparts. E.g., for the default quantile type 7, the continuous (lower) target element is (n - 1) * p. In the weighted algorithm, this became (sum(w) - mean(w)) * p and was compared to the cumulative sum of ordered (by x) weights, to preserve equivalence of the algorithms in cases where the weights are all equal. However, upon a second thought, the use of mean(w) does not really reflect a standard interpretation of the weights as frequencies. I have reasoned that using min(w) instead of mean(w) better reflects such an interpretation, as the minimum (non-zero) weight reflects the size of the smallest sampled unit. So the weighted quantile type 7 target is now (sum(w) - min(w)) * p, and also the other methods have been adjusted accordingly (note that zero weight observations are ignored in the algorithm).
This is more a Note than a change to the package: there is an issue with vctrs that users can encounter using collapse together with the tidyverse (especially ggplot2), which is that collapse internally optimizes computations on factors by giving them an additional "na.included" class if they are known to not contain any missing values. For example pivot(mtcars) gives a "variable" factor which has class c("factor", "na.included"), such that grouping on "variable" in subsequent operations is faster. Unfortunately, pivot(mtcars) |> ggplot(aes(y = value)) + geom_histogram() + facet_wrap( ~ variable) currently gives an error produced by vctrs, because vctrs does not implement a standard S3 method dispatch and thus does not ignore the "na.included" class. It turns out that the only way for me to deal with this is would be to swap the order of classes i.e. c("na.included", "factor"), import vctrs, and implement vec_ptype2 and vec_cast methods for "na.included" objects. This will never happen, as collapse is and will remain independent of the tidyverse. There are two ways you can deal with this: The first way is to remove the "na.included" class for ggplot2 e.g. facet_wrap( ~ set_class(variable, "factor")) or facet_wrap( ~ factor(variable)) will both work. The second option is to define a function vec_ptype2.factor.factor <- function(x, y, ...) x in your global environment, which avoids vctrs performing extra checks on factor objects.