R Polars Versions Save

Bring polars to R

v0.17.0

1 week ago

Breaking changes

  • Updated rust-polars to unreleased version (> 0.40.0) (#1104, #1110, #1117, #1124):
    • In $join(), there is a new argument coalesce and the how options now accept "full" instead of "outer" and "outer_coalesce".
    • $top_k() and $bottom_k() gain three arguments nulls_last, maintain_order and multithreaded.
    • All $rolling_*() functions lose the arguments by, closed and warn_if_unsorted. Rolling computations based on by must be made via the corresponding rolling_*_by(), e.g rolling_mean_by() instead of rolling_mean(by =) (#1115).
    • pl$scan_parquet() and pl$read_parquet() gain an argument glob which defaults to TRUE. Set it to FALSE to avoid considering * as a globing pattern.
    • $is_not_nan() on a null value (NA in R) now returns null. Previously, it returned TRUE.
    • In $reshape(), argument dims is renamed dimensions and there is a new argument nested_type specifying if the output should be of type List or Array.
    • In $value_counts(), all arguments must be named and there is a new argument name to specify the name of the output.
    • In all functions accepting optimization parameter (such as projection_pushdown), there is a new parameter cluster_with_columns to combine sequential independent calls to $with_columns().
    • $str$explode() is removed.
    • The check_sorted argument is removed from $rolling() and $group_by_dynamic(). Sortedness is now verified in a quick manner, so this argument is no longer needed (pola-rs/polars#16494).
    • $name$map() stacks on Linux, so this method is deprecated and the document is removed. Please use other methods like <LazyFrame>$rename(<function>) instead (#1123).
  • As warned in v0.16.0, the order of arguments in pl$Series is changed (#1071). The first argument is now name, and the second argument is values.
  • $to_struct() on an Expr is removed. This method is now only available for Series, DataFrame, and in the $list and $arr subnamespaces. For example, pl$col("a", "b", "c")$to_struct() should be replaced with pl$struct(c("a", "b", "c")) (#1092).
  • pl$Struct() now only accepts named inputs and objects of class RPolarsField. For example, pl$Struct(pl$Boolean) doesn't work anymore and should be named like pl$Struct(a = pl$Boolean) (#1053).
  • In $all() and $any(), the argument drop_nulls is renamed ignore_nulls, and this argument must be named (#1050).
  • New method $struct$with_fields() (#1109) and new function pl$field() to be used in expressions in $struct$with_fields() (#1113).
  • New methods for RPolarsDataType: $is_enum(), $is_categorical(), $is_known(), $is_string(), $contains_views(), $contains_categorical() (#1112).
  • In $dt$combine(), the arguments tm and tu are renamed time and time_unit (#1116).
  • The default value of the rechunk argument of pl$concat() is changed from TRUE to FALSE (#1125).
  • In $rename() for LazyFrame and DataFrame, key-value pairs of names are changed to old_name = "new_name" instead of new_name = "old_name" (#1129).
  • In $rename() for LazyFrame and DataFrame, no argument is not allowed (#1129).
  • In all $rolling_*() functions, the arguments center and ddof must be named (#1115).

New features

  • Allow specify a function in $rename() for LazyFrame and DataFrame. They are equivalent to polars.LazyFrame.rename(mapping: Callable[[str], str]) or polars.DataFrame.rename(mapping: Callable[[str], str]) in Python Polars (#1122, #1129).

Full Changelog: https://github.com/pola-rs/r-polars/compare/v0.16.4...v0.17.0

lib-v0.40.0

1 week ago

v0.16.4

1 month ago

New features

  • pl$read_ipc() can read a raw vector of Apache Arrow IPC file (#1072).
  • New method <DataFrame>$to_raw_ipc() to serialize a DataFrame to a raw vector of Apache Arrow IPC file format (#1072).
  • New method <LazyFrame>$serialize() to serialize a LazyFrame to a character vector of JSON representation (#1073).
  • New function pl$deserialize_lf() to deserialize a LazyFrame from a character vector of JSON representation (#1073).
  • New methods $str$head() and $str$tail() (#1074).
  • New S3 methods nanoarrow::as_nanoarrow_array_stream() and nanoarrow::infer_nanoarrow_schema() for RPolarsSeries (#1076).
  • New method $dt$is_leap_year() (#1077).
  • as_polars_df() and as_polars_series() supports arrow::RecordBatchReader (#1078).
  • The new experimental argument for as_polars_df(<ArrowTabular>), as_polars_df(<RecordBatchReader>), as_polars_series(<nanoarrow_array_stream>), and as_polars_df(<nanoarrow_array_stream>) (#1078). If experimental = TRUE, these functions switch to use the Arrow C stream interface internally. At this point, the performance is degraded under the expected use cases, so the default is set to experimental = FALSE.

Full Changelog: https://github.com/pola-rs/r-polars/compare/v0.16.3...v0.16.4

lib-v0.39.3

1 month ago

v0.16.3

1 month ago

New features

  • New method <SQLContext>$register_globals() (#1064).
  • New experimental method $sql() for DataFrame and LazyFrame (#1065).

Miscellaneous

  • Move the API document website to the new place (#1067, #1068). Access to the old website is set to redirect to the top page of the new website.
    • Old URL: https://rpolars.github.io/
    • New URL: https://pola-rs.github.io/r-polars/

Full Changelog: https://github.com/pola-rs/r-polars/compare/v0.16.2...v0.16.3

v0.16.2

1 month ago

New features

  • $cut() and $qcut() to bin continuous values into discrete categories (#1057).
  • pl$scan_parquet() and pl$read_parquet() can read data from the internet by specifying a URL to the first argument (#1056, @andyquinterom).
  • pl$scan_parquet() and pl$read_parquet() gain an argument storage_options to scan/read data via cloud storage providers (GCP, AWS, Azure). Note that this support is experimental (#1056, @andyquinterom).
  • Add support for the Enum datatype via pl$Enum() (#1061).

Bug fixes

  • In some read/scan functions, downloading files could fail if the URL was too long. This is now fixed (#1049, @DyfanJones).

New Contributors

Full Changelog: https://github.com/pola-rs/r-polars/compare/v0.16.1...v0.16.2

lib-v0.39.2

1 month ago

v0.16.1

1 month ago

This is a small hot-fix release to update dependent Rust polars to 0.39.1 (#1042).

Also, there are some updates.

Bug fixes

  • $len() now correctly includes null values in the count (#1044).

Other improvements

  • $arr$max() and $arr$min() work without the nightly feature (#1042).

Full Changelog: https://github.com/pola-rs/r-polars/compare/v0.16.0...v0.16.1

lib-v0.39.1

1 month ago

v0.16.0

1 month ago

Breaking changes

  • Rust polars is updated to 0.39.0 (#937, #1034).

  • R objects inside an R list are now converted to Polars data types via as_polars_series() (#1021, #1022, #1023). For example, up to polars 0.15.1, a list containing a data.frame with a column of {clock} naive-time class was converted to a nested List type of Float64:

    data = data.frame(time = clock::naive_time_parse("1990-01-01", precision = "day"))
    pl$select(
      nested_data = pl$lit(list(data))
    )
    #> shape: (1, 1)
    #> ┌──────────────────────────┐
    #> │ nested_data              │
    #> │ ---                      │
    #> │ list[list[list[f64]]]    │
    #> ╞══════════════════════════╡
    #> │ [[[2.1475e9], [7305.0]]] │
    #> └──────────────────────────┘
    

    From 0.16.0, nested types are correctly converted, so that will be a List type of Struct type containing a Datetime type.

    data = data.frame(time = clock::naive_time_parse("1990-01-01", precision = "day"))
    pl$select(
      nested_data = pl$lit(list(data))
    )
    #> shape: (1, 1)
    #> ┌─────────────────────────┐
    #> │ nested_data             │
    #> │ ---                     │
    #> │ list[struct[1]]         │
    #> ╞═════════════════════════╡
    #> │ [{1990-01-01 00:00:00}] │
    #> └─────────────────────────┘
    
  • Several functions have been rewritten to match the behavior of Python Polars. There are four types of changes: i) change in argument names, ii) change in the way arguments are passed (named or by position), iii) arguments are removed, and iv) change in the default and accepted values. Those are addressed separately below.

    1. Change in argument names:

      • In $reshape(), the dims argument is renamed to dimensions (#1019).
      • In pl$read_* and pl$scan_* functions, the first argument is now source (#935).
      • In pl$Series(), the argument x is renamed values (#933).
      • In <DataFrame>$write_* functions, the first argument is now file (#935).
      • In <LazyFrame>$sink_* functions, the first argument is now path (#935).
      • In <LazyFrame>$sink_ipc(), the argument memmap is renamed to memory_map (#1032).
      • In <DataFrame>$rolling(), <LazyFrame>$rolling(), <DataFrame>$group_by_dynamic() and <LazyFrame>$group_by_dynamic(), the by argument is renamed to group_by (#983).
      • In $dt$convert_time_zone() and $dt$replace_time_zone(), the tz argument is renamed to time_zone (#944).
      • In $str$strptime(), the argument datatype is renamed to dtype (#939).
      • In $str$to_integer() (renamed from $str$parse_int()), argument radix is renamed to base (#1038).
    2. Change in the way arguments are passed:

      • In all input/output functions, all arguments except the first argument must be named arguments (#935).

      • In <DataFrame>$rolling() and <DataFrame>$group_by_dynamic(), all arguments except index_column must be named arguments (#983).

      • In $unique() for DataFrame and LazyFrame, arguments keep and maintain_order must be named (#953).

      • In $bin$decode(), the strict argument must be a named argument (#980).

      • In $dt$replace_time_zone(), all arguments except time_zone must be named arguments (#944).

      • In $str$contains(), the arguments literal and strict must be named (#982).

      • In $str$contains_any(), the ascii_case_insensitive argument must be named (#986).

      • In $str$count_matches(), $str$replace() and $str$replace_all(), the literal argument must be named (#987).

      • In $str$strptime(), $str$to_date(), $str$to_datetime(), and $str$to_time(), all arguments (except the first one) must be named (#939).

      • In $str$to_integer() (renamed from $str$parse_int()), all arguments must be named (#1038).

      • In pl$date_range(), the arguments closed, time_unit, and time_zone must be named (#950).

      • In $set_sorted() and $sort_by(), argument descending must be named (#1034).

      • In pl$Series(), using positional arguments throws a warning, since the argument positions will be changed in the future (#966).

        # polars 0.15.1 or earlier
        # The first argument is `x`, the second argument is `name`.
        pl$Series(1:3, "foo")
        
        # The code above will warn in 0.16.0
        # Use named arguments to silence the warning.
        pl$Series(values = 1:3, name = "foo")
        pl$Series(name = "foo", values = 1:3)
        
        # polars 0.17.0 or later (future version)
        # The first argument is `name`, the second argument is `values`.
        pl$Series("foo", 1:3)
        

        This warning can also be silenced by replacing pl$Series(<values>, <name>) by as_polars_series(<values>, <name>).

    3. Arguments removed:

      • The argument columns in $drop() is removed. $drop() now accepts several character scalars, such as $drop("a", "b", "c") (#912).
      • In pl$col(), the name argument is removed, and the ... argument no longer accepts a list of characters and RPolarsSeries class objects (#923).
      • In pl$date_range(), the unused argument (not working in recent versions) explode is removed. (#950).
    4. Change in arguments default and accepted values:

      • In pl$Series(), the argument values has a new default value NULL (#966).
      • In $unique() for DataFrame and LazyFrame, argument keep has a new default value "any" (#953).
      • In rolling aggregation functions (such as $rolling_mean()), the default value of argument closed now is NULL. Using closed with a fixed window_size now throws an error (#937).
      • In pl$date_range(), the argument end must be specified and the default value of interval is changed to "1d". The arguments start and end no longer accept numeric values (#950).
      • In pl$scan_parquet(), the default value of the argument rechunk is changed from TRUE to FALSE (#1033).
      • In pl$scan_parquet() and pl$read_parquet(), the argument parallel only accepts "auto", "columns", "row_groups", and "none". Previously, it also accepted upper-case notation of "auto", "columns", "none", and "RowGroups" instead of "row_groups" (#1033).
      • In $str$to_integer() (renamed from $str$parse_int()), the default value of base is changed from 2 to 10 (#1038).
  • The usage of pl$date_range() to create a range of Datetime data type is deprecated. pl$date_range() will always create a range of Date data type in the future. Use pl$datetime_range() if you want to create a range of Datetime instead (#950).

  • <DataFrame>$get_columns() now returns an unnamed list instead of a named list (#991).

  • Removed $argsort() which was an old alias for $arg_sort() (#930).

  • Removed pl$expr_to_r() which was an alias for $to_r() (#938).

  • <Series>$to_r_list() is renamed <Series>$to_list() (#938).

  • Removed <Series>$to_r_vector() which was an old alias for <Series>$to_vector() (#938).

  • Removed <Expr>$rep_extend(), which was an experimental method created at the early stage of this package and does not exist in other language APIs (#1028).

  • The following deprecated functions are now removed: pl$threadpool_size(), <DataFrame>$with_row_count(), <LazyFrame>$with_row_count() (#965).

  • In $group_by_dynamic(), the first datapoint is always preserved (#1034).

  • $str$parse_int() is renamed to $str$to_integer() (#1038).

New features

  • New functions:

    • pl$arg_sort_by() (#929).
    • pl$arg_where() to get the indices that match a condition (#922).
    • pl$datetime(), pl$date(), and pl$time() to easily create Expr of class datetime, date, and time via columns and literals (#918).
    • pl$datetime_range(), pl$date_ranges() and pl$datetime_ranges() (#950, #962).
    • pl$int_range() and pl$int_ranges() (#968)
    • pl$mean_horizontal() (#959)
    • pl$read_ipc() (#1033).
    • is_polars_dtype() (#927).
  • New methods:

    • <LazyFrame>$to_dot() to print the query plan of a LazyFrame with graphviz dot syntax (#928).
    • $clear() for DataFrame, LazyFrame, and Series (#1004).
    • $item() for DataFrame and Series (#992).
    • $select_seq() and $with_columns_seq() for DataFrame and LazyFrame (#1003).
    • $arr$to_list() (#1018).
    • $str$extract_groups() (#979).
    • $str$find() (#985).
    • <DataFrame>$write_ipc() (#1032).
    • RPolarsDataType gains several methods to check the datatype, such as $is_integer(), $is_null() or $is_list() (#1036).
  • New arguments or argument values:

    • ambiguous can now take the value "null" to convert ambigous datetimes to null values (#937).
    • n in $str$replace() (#987).
    • non_existent in $dt$replace_time_zone() to specify what should happen when a datetime doesn't exist.
    • mapping_strategy in $over() (#984, #988).
    • raise_if_undetermined in $meta$output_name() (#961).
    • null_on_oob in $arr$get() and $list$get() to determine what happens when the index is out of bounds (#1034).
    • nulls_last, multithreaded, and maintain_order in $sort_by() (#1034).
  • Other:

    • pl$Series() now calls as_polars_series() internally, so it can convert more classes to Series properly (#1015).
    • Export the Duration datatype (#955).
    • New active binding <Series>$struct$fields (#1002).
    • All $write_*() and $sink_*() functions now invisibly return the input data (#1039).

Bug fixes

  • The join_nulls and validate arguments of <DataFrame>$join() now work correctly (#945).
  • We said in the changelog of 0.14.0 that all row_count_* args in I/O functions were renamed row_index_*, but this change was not made for CSV and IPC functions. This renaming is now made (#964).
  • Evaluating Series methods from Expr inside functions now works correctly (#973). Thanks @Yunuuuu for the report.
  • The dependent crate extendr-api is updated to 2024-03-31 unreleased version (#995). The issue that the R session crashes when a panic occurs in the Rust side is resolved. Thanks @CGMossa for the upstream fix.
  • The parallel argument of pl$scan_parquet() and pl$read_parquet() now works correctly (#1033). Previously, any correct value was treated as "auto".

New Contributors

Full Changelog: https://github.com/pola-rs/r-polars/compare/v0.15.1...v0.16.0