Bring polars to R
$join()
, there is a new argument coalesce
and the how
options now accept "full"
instead of "outer"
and "outer_coalesce"
.$top_k()
and $bottom_k()
gain three arguments nulls_last
, maintain_order
and multithreaded
.$rolling_*()
functions lose the arguments by
, closed
and warn_if_unsorted
. Rolling computations based on by
must be made via the corresponding rolling_*_by()
, e.g rolling_mean_by()
instead of rolling_mean(by =)
(#1115).pl$scan_parquet()
and pl$read_parquet()
gain an argument glob
which defaults to TRUE
. Set it to FALSE
to avoid considering *
as a globing pattern.$is_not_nan()
on a null
value (NA
in R) now returns null
. Previously, it returned TRUE
.$reshape()
, argument dims
is renamed dimensions
and there is a new argument nested_type
specifying if the output should be of type List or Array.$value_counts()
, all arguments must be named and there is a new argument name
to specify the name of the output.projection_pushdown
), there is a new parameter cluster_with_columns
to combine sequential independent calls to $with_columns()
.$str$explode()
is removed.check_sorted
argument is removed from $rolling()
and $group_by_dynamic()
. Sortedness is now verified in a quick manner, so this argument is no longer needed (pola-rs/polars#16494).$name$map()
stacks on Linux, so this method is deprecated and the document is removed. Please use other methods like <LazyFrame>$rename(<function>)
instead (#1123).pl$Series
is changed (#1071). The first argument is now name
, and the second argument is values
.$to_struct()
on an Expr is removed. This method is now only available for Series
, DataFrame
, and in the $list
and $arr
subnamespaces. For example, pl$col("a", "b", "c")$to_struct()
should be replaced with pl$struct(c("a", "b", "c"))
(#1092).pl$Struct()
now only accepts named inputs and objects of class RPolarsField
. For example, pl$Struct(pl$Boolean)
doesn't work anymore and should be named like pl$Struct(a = pl$Boolean)
(#1053).$all()
and $any()
, the argument drop_nulls
is renamed ignore_nulls
, and this argument must be named (#1050).$struct$with_fields()
(#1109) and new function pl$field()
to be used in expressions in $struct$with_fields()
(#1113).RPolarsDataType
: $is_enum()
, $is_categorical()
, $is_known()
, $is_string()
, $contains_views()
, $contains_categorical()
(#1112).$dt$combine()
, the arguments tm
and tu
are renamed time
and time_unit
(#1116).rechunk
argument of pl$concat()
is changed from TRUE
to FALSE
(#1125).$rename()
for LazyFrame and DataFrame, key-value pairs of names are changed to old_name = "new_name"
instead of new_name = "old_name"
(#1129).$rename()
for LazyFrame and DataFrame, no argument is not allowed (#1129).$rolling_*()
functions, the arguments center
and ddof
must be named (#1115).$rename()
for LazyFrame and DataFrame. They are equivalent to polars.LazyFrame.rename(mapping: Callable[[str], str])
or polars.DataFrame.rename(mapping: Callable[[str], str])
in Python Polars (#1122, #1129).Full Changelog: https://github.com/pola-rs/r-polars/compare/v0.16.4...v0.17.0
pl$read_ipc()
can read a raw vector of Apache Arrow IPC file (#1072).<DataFrame>$to_raw_ipc()
to serialize a DataFrame to a raw vector of Apache Arrow IPC file format (#1072).<LazyFrame>$serialize()
to serialize a LazyFrame to a character vector of JSON representation (#1073).pl$deserialize_lf()
to deserialize a LazyFrame from a character vector of JSON representation (#1073).$str$head()
and $str$tail()
(#1074).nanoarrow::as_nanoarrow_array_stream()
and nanoarrow::infer_nanoarrow_schema()
for RPolarsSeries
(#1076).$dt$is_leap_year()
(#1077).as_polars_df()
and as_polars_series()
supports arrow::RecordBatchReader
(#1078).experimental
argument for as_polars_df(<ArrowTabular>)
, as_polars_df(<RecordBatchReader>)
, as_polars_series(<nanoarrow_array_stream>)
, and as_polars_df(<nanoarrow_array_stream>)
(#1078).
If experimental = TRUE
, these functions switch to use the Arrow C stream interface internally.
At this point, the performance is degraded under the expected use cases, so the default is set to experimental = FALSE
.Full Changelog: https://github.com/pola-rs/r-polars/compare/v0.16.3...v0.16.4
<SQLContext>$register_globals()
(#1064).$sql()
for DataFrame and LazyFrame (#1065).https://rpolars.github.io/
https://pola-rs.github.io/r-polars/
Full Changelog: https://github.com/pola-rs/r-polars/compare/v0.16.2...v0.16.3
$cut()
and $qcut()
to bin continuous values into discrete categories (#1057).pl$scan_parquet()
and pl$read_parquet()
can read data from the internet by specifying a URL to the first argument (#1056, @andyquinterom).pl$scan_parquet()
and pl$read_parquet()
gain an argument storage_options
to scan/read data via cloud storage providers (GCP, AWS, Azure). Note that this support is experimental (#1056, @andyquinterom).Enum
datatype via pl$Enum()
(#1061).Full Changelog: https://github.com/pola-rs/r-polars/compare/v0.16.1...v0.16.2
This is a small hot-fix release to update dependent Rust polars to 0.39.1 (#1042).
Also, there are some updates.
$len()
now correctly includes null
values in the count (#1044).$arr$max()
and $arr$min()
work without the nightly
feature (#1042).Full Changelog: https://github.com/pola-rs/r-polars/compare/v0.16.0...v0.16.1
Rust polars is updated to 0.39.0 (#937, #1034).
R objects inside an R list are now converted to Polars data types via
as_polars_series()
(#1021, #1022, #1023). For example, up to polars 0.15.1,
a list containing a data.frame with a column of {clock}
naive-time class
was converted to a nested List type of Float64:
data = data.frame(time = clock::naive_time_parse("1990-01-01", precision = "day"))
pl$select(
nested_data = pl$lit(list(data))
)
#> shape: (1, 1)
#> ┌──────────────────────────┐
#> │ nested_data │
#> │ --- │
#> │ list[list[list[f64]]] │
#> ╞══════════════════════════╡
#> │ [[[2.1475e9], [7305.0]]] │
#> └──────────────────────────┘
From 0.16.0, nested types are correctly converted, so that will be a List type of Struct type containing a Datetime type.
data = data.frame(time = clock::naive_time_parse("1990-01-01", precision = "day"))
pl$select(
nested_data = pl$lit(list(data))
)
#> shape: (1, 1)
#> ┌─────────────────────────┐
#> │ nested_data │
#> │ --- │
#> │ list[struct[1]] │
#> ╞═════════════════════════╡
#> │ [{1990-01-01 00:00:00}] │
#> └─────────────────────────┘
Several functions have been rewritten to match the behavior of Python Polars. There are four types of changes: i) change in argument names, ii) change in the way arguments are passed (named or by position), iii) arguments are removed, and iv) change in the default and accepted values. Those are addressed separately below.
Change in argument names:
$reshape()
, the dims
argument is renamed to dimensions
(#1019).pl$read_*
and pl$scan_*
functions, the first argument is now
source
(#935).pl$Series()
, the argument x
is renamed values
(#933).<DataFrame>$write_*
functions, the first argument is now file
(#935).<LazyFrame>$sink_*
functions, the first argument is now path
(#935).<LazyFrame>$sink_ipc()
, the argument memmap
is renamed to memory_map
(#1032).<DataFrame>$rolling()
, <LazyFrame>$rolling()
, <DataFrame>$group_by_dynamic()
and <LazyFrame>$group_by_dynamic()
, the by
argument is renamed to
group_by
(#983).$dt$convert_time_zone()
and $dt$replace_time_zone()
, the tz
argument is renamed to time_zone
(#944).$str$strptime()
, the argument datatype
is renamed to dtype
(#939).$str$to_integer()
(renamed from $str$parse_int()
), argument radix
is
renamed to base
(#1038).Change in the way arguments are passed:
In all input/output functions, all arguments except the first argument must be named arguments (#935).
In <DataFrame>$rolling()
and <DataFrame>$group_by_dynamic()
, all
arguments except index_column
must be named arguments (#983).
In $unique()
for DataFrame
and LazyFrame
, arguments keep
and
maintain_order
must be named (#953).
In $bin$decode()
, the strict
argument must be a named argument (#980).
In $dt$replace_time_zone()
, all arguments except time_zone
must be named
arguments (#944).
In $str$contains()
, the arguments literal
and strict
must be named
(#982).
In $str$contains_any()
, the ascii_case_insensitive
argument must be
named (#986).
In $str$count_matches()
, $str$replace()
and $str$replace_all()
,
the literal
argument must be named (#987).
In $str$strptime()
, $str$to_date()
, $str$to_datetime()
, and
$str$to_time()
, all arguments (except the first one) must be named (#939).
In $str$to_integer()
(renamed from $str$parse_int()
), all arguments
must be named (#1038).
In pl$date_range()
, the arguments closed
, time_unit
, and time_zone
must be named (#950).
In $set_sorted()
and $sort_by()
, argument descending
must be named
(#1034).
In pl$Series()
, using positional arguments throws a warning, since the
argument positions will be changed in the future (#966).
# polars 0.15.1 or earlier
# The first argument is `x`, the second argument is `name`.
pl$Series(1:3, "foo")
# The code above will warn in 0.16.0
# Use named arguments to silence the warning.
pl$Series(values = 1:3, name = "foo")
pl$Series(name = "foo", values = 1:3)
# polars 0.17.0 or later (future version)
# The first argument is `name`, the second argument is `values`.
pl$Series("foo", 1:3)
This warning can also be silenced by replacing pl$Series(<values>, <name>)
by as_polars_series(<values>, <name>)
.
Arguments removed:
columns
in $drop()
is removed. $drop()
now accepts
several character scalars, such as $drop("a", "b", "c")
(#912).pl$col()
, the name
argument is removed, and the ...
argument no
longer accepts a list of characters and RPolarsSeries
class objects (#923).pl$date_range()
, the unused argument (not working in recent versions)
explode
is removed. (#950).Change in arguments default and accepted values:
pl$Series()
, the argument values
has a new default value NULL
(#966).$unique()
for DataFrame
and LazyFrame
, argument keep
has a new
default value "any"
(#953).$rolling_mean()
), the default
value of argument closed
now is NULL
. Using closed
with a fixed
window_size
now throws an error (#937).pl$date_range()
, the argument end
must be specified and the default
value of interval
is changed to "1d"
. The arguments start
and end
no longer accept numeric values (#950).pl$scan_parquet()
, the default value of the argument rechunk
is
changed from TRUE
to FALSE
(#1033).pl$scan_parquet()
and pl$read_parquet()
, the argument parallel
only accepts "auto"
, "columns"
, "row_groups"
, and "none"
.
Previously, it also accepted upper-case notation of "auto"
, "columns"
,
"none"
, and "RowGroups"
instead of "row_groups"
(#1033).$str$to_integer()
(renamed from $str$parse_int()
), the default
value of base
is changed from 2
to 10
(#1038).The usage of pl$date_range()
to create a range of Datetime
data type is
deprecated. pl$date_range()
will always create a range of Date
data type
in the future. Use pl$datetime_range()
if you want to create a range of
Datetime
instead (#950).
<DataFrame>$get_columns()
now returns an unnamed list instead of a named
list (#991).
Removed $argsort()
which was an old alias for $arg_sort()
(#930).
Removed pl$expr_to_r()
which was an alias for $to_r()
(#938).
<Series>$to_r_list()
is renamed <Series>$to_list()
(#938).
Removed <Series>$to_r_vector()
which was an old alias for
<Series>$to_vector()
(#938).
Removed <Expr>$rep_extend()
, which was an experimental method created at the
early stage of this package and does not exist in other language APIs (#1028).
The following deprecated functions are now removed: pl$threadpool_size()
,
<DataFrame>$with_row_count()
, <LazyFrame>$with_row_count()
(#965).
In $group_by_dynamic()
, the first datapoint is always preserved (#1034).
$str$parse_int()
is renamed to $str$to_integer()
(#1038).
New functions:
pl$arg_sort_by()
(#929).pl$arg_where()
to get the indices that match a condition (#922).pl$datetime()
, pl$date()
, and pl$time()
to easily create Expr of class
datetime, date, and time via columns and literals (#918).pl$datetime_range()
, pl$date_ranges()
and pl$datetime_ranges()
(#950, #962).pl$int_range()
and pl$int_ranges()
(#968)pl$mean_horizontal()
(#959)pl$read_ipc()
(#1033).is_polars_dtype()
(#927).New methods:
<LazyFrame>$to_dot()
to print the query plan of a LazyFrame with graphviz
dot syntax (#928).$clear()
for DataFrame
, LazyFrame
, and Series
(#1004).$item()
for DataFrame
and Series
(#992).$select_seq()
and $with_columns_seq()
for DataFrame
and LazyFrame
(#1003).$arr$to_list()
(#1018).$str$extract_groups()
(#979).$str$find()
(#985).<DataFrame>$write_ipc()
(#1032).RPolarsDataType
gains several methods to check the datatype, such as
$is_integer()
, $is_null()
or $is_list()
(#1036).New arguments or argument values:
ambiguous
can now take the value "null"
to convert ambigous datetimes to
null values (#937).n
in $str$replace()
(#987).non_existent
in $dt$replace_time_zone()
to specify what should happen
when a datetime doesn't exist.mapping_strategy
in $over()
(#984, #988).raise_if_undetermined
in $meta$output_name()
(#961).null_on_oob
in $arr$get()
and $list$get()
to determine what happens
when the index is out of bounds (#1034).nulls_last
, multithreaded
, and maintain_order
in $sort_by()
(#1034).Other:
pl$Series()
now calls as_polars_series()
internally, so it can convert
more classes to Series properly (#1015).Duration
datatype (#955).<Series>$struct$fields
(#1002).$write_*()
and $sink_*()
functions now invisibly return the input
data (#1039).join_nulls
and validate
arguments of <DataFrame>$join()
now work
correctly (#945).row_count_*
args in I/O functions
were renamed row_index_*
, but this change was not made for CSV and IPC
functions. This renaming is now made (#964).Series
methods from Expr
inside functions now works correctly (#973).
Thanks @Yunuuuu for the report.extendr-api
is updated to 2024-03-31 unreleased version (#995).
The issue that the R session crashes when a panic occurs in the Rust side is resolved.
Thanks @CGMossa for the upstream fix.parallel
argument of pl$scan_parquet()
and pl$read_parquet()
now works
correctly (#1033). Previously, any correct value was treated as "auto"
.Full Changelog: https://github.com/pola-rs/r-polars/compare/v0.15.1...v0.16.0