Modin: Scale your Pandas workflows by changing a single line of code
This release reverts the pandas requirement from 2.2.1 to >=2.2,<2.3
@sfc-gh-mvashishtha
This release introduces modin.pandas.api.extensions
module, faster implementations for merge
and
groupby.rolling
(by default) functions, and new functions to work with Ray Dataset: to/from_ray_dataset
.
It also includes some other new features, performance optimizations and bug fixes.
merge
when right operand is an empty dataframe (#6941)read_parquet
when dataset is created with to_parquet
and index=False
(#6937)isort
formatting for scripts from tutorials (#6945)needs: [lint-black-isort, ...]
(#6947)groupby
when Modin dataframe has several column partitions (#6951)render_as_string
to get sqlalchemy engine url (#6953)test_all_urls_exist
(#6975)._propagate_index_objs()
(#6977)._copartition()
for identical indices on binary operations (#6980)read_pickle_distributed/to_pickle_distributed
to read_pickle_glob/to_pickle_glob
(#6957)modin.pandas.DataFrame._to_pandas
a public method (#6940)DataFrame.to_pickle_distributed
in favour of DataFrame.modin.to_pickle_distributed
(#6959)eval_general
utility (#7003)check_exception_type
argument of eval_general
function (#7009)to_pandas
and to_ray_dataset
into modin namespace (#7014)to_hdf
and hist
signatures to pandas (#7018)pandas._testing.makeStringIndex
(#6933)test_series.py
(#6995)test_io.py
(#6997)log_level
in logging module (#6992)read_sql
by getting connection url (#6956)include_groups=False
parameter in groupby.apply()
(#6938)groupby().rolling()
by default (#6943).merge()
using range-partitioning implementation (#6966)to/from_ray_dataset
functions (#6971)MetaList.__getitem__()
(#7006)@AndreyPavlenko @Retribution98 @YarShev @anmyachev @arunjose696 @dchigarev @sfc-gh-dpetersohn @tochigiv
This release updates pandas to 2.2, introduces lazy execution mode on Ray, new functions that support glob syntax and speeds up several more groupby cases. It also includes some other new features, performance optimizations and many bug fixes.
tolist
function in DtypesDescriptor._merge_dtypes
(#6844)read_parquet
works with integer columns for pyarrow engine (#6874)query_compiler.merge
(#6880)astype
works correctly with int32
and float32
dtypes (#6884).from_pandas()
(#6912)pydantic
dependency (#6917)JoinNode
instead of MaskNode
for non-range row_position (#6926)iloc
where beneficial (#6878)DaskThreadsPerWorker
to 1 (#6923)missmatch
to mismatch
in ErrorMessage.missmatch_with_pandas
method (#6901)PyarrowOnRay
execution in favour of pyarrow-backed pandas dataframes (#6848)SocksProxy
, DoLogRpyc
, DoTraceRpyc
outdated classes (#6834)OrderedDict
in favor of builtin dict
(#6853)_get_dimensions
and change arguments (#6859)__all__
in modin.config.__init__.py
(#6886)create_test_series
(#6910)tmp_path
fixture (#6709)to_csv
tests on Unidist more stable (for test-all-unidist
CI job) (#6851)to_csv
tests (#6847)gs
remote protocol since we rely on fsspec
(#6882)black>=24.1.0
(#6887)pytest 8.0.0
(#6894)read_json_glob
and to_json_glob
(#6873)read_parquet_glob
and to_parquet_glob
(#6854)read_xml_glob
, to_xml_glob
(#6930).length()
and .width()
being called in a loop (#6842)to_pandas
call in merge
and join
functions (#6850)2.2.*
(#6907)@AndreyPavlenko @YarShev @anmyachev @arunjose696 @dchigarev @leshikus @vedant
This release includes a fix for concat
function.
tolist
function in DtypesDescriptor._merge_dtypes
(#6844)to_csv
tests on Unidist more stable (for test-all-unidist
CI job) (#6851)to_csv
tests (#6847)@leshikus @anmyachev
This release introduces a new, faster implementation for groupby.apply
, as well as many performance fixes related to improving asynchronous execution, a new namespace for accessing experimental functions (for example, DataFrame.modin.to_pickle_distributed
), a fix for a long-standing problem with the use of Modin objects inside UDFs for apply
and many other fixes.
Note: to get Modin on MPI through unidist (as of unidist 0.5.0) fully working by installing with pip it is required to have a working MPI implementation installed beforehand.
apply
(#6673)@lazy_metadata_decorator
for PandasDataFrame.finalize
(#6720)astype
op (#6692)set_index_name(None)
(#6698)unidist <= 0.4.1
(#6746).insert()
(#6757)to_numpy
use **kwargs
after #6704 (#6769)ValueError: assignment destination is read-only
for cumsum
(#6772)_to_pandas
return mutable pandas objects (#6775)loc
to get similar behavior to pandas (#6798)Series.__getitem__
(#6780)pandas.api.types.pandas_dtype
to convert to valid numpy and pandas only dtypes (#6788)DataFrame.join
(#6787)ModinIndex
objects (#6800)NotImplementedError
to a user on a set_columns()
with dupl labels (#6823)ModinIndex._lengths_id
on empty partitions filtering (#6825)copy=True
parameter for concat
calls inside to_pandas
(#4778)broadcast_apply_full_axis
(#6760)reset_index
for left merge
(#6665)copy=False
for internal usage of set_axis
(#6667)copy()
call for Series.reset_index
(#6670)Series.tolist
function (#6672)sync_labels=False
for rank
function (#6689)lazy_map_partitions()
for dtypes conversion (#6695)get_axis
internal function instead of axes
property (#6700)to_numpy
(#6699)_groupby_shuffle
internal function (#6707)_shape_hint
in query_complier.copy
function (#6713)qc._shape_hint = column
in columnarize
function (#6715)_filter_empties
(#6717)_get_axis_lengths
function instead of _axes_lengths
property (#6719)keep_partitioning=True
, for duplicated
implementation (#6722)_shape_hint = "column"
in DataFrame.squeeze
(#6724)result.name = None
in groupby code (#6726)reset_index()
(#6751).__setitem__()
(#6758).concat(axis=0)
(#6759)execution_wrapper
instead of directly addressing DaskWrapper
(#6740)modin.pandas.io
module (#6806)modin.experimental
folder (#6813)--extra-test-parameters
option (#6730)to_csv
tests on Unidist more stable (#6776)int
type (#6796)DataFrame.__rdivmod__/__divmod__
(#6785)modin.pandas.error
module (#6802)groupby.apply()
by default (#6804)@AndreyPavlenko @JignyasAnand @RehanSD @YarShev @anmyachev @devin-petersohn @dchigarev @mvashishtha @seydar
Hotfix for Unidist.
Note: broken pip wheel, use https://github.com/modin-project/modin/releases/tag/0.24.1.post1 instead
@anmyachev @dchigarev
The main purpose of this release is to port as many fixes as possible to the latest version, which supports Python 3.8.
unidist<=0.4.1
read_excel
: defaults to pandas for unsupported types of io
(#6462)ray.get()
inside of the kernel executing call queues (#6633)Column.null_count
to return a built-in int
instead of NumPy scalar (#6526)unwrap_partitions
for virtual partitions when axis=None
(#6560)__getattribute__
for experimental mode (#6529)groupby.apply()
for UDFs that change the output's shape (#6506)is_bool_dtype()
for categorical (#6480)reshuffling
in case of a string key (#6510)test_sort_cols_str
from test_dataframe.py
crashed on HDK 0.7.0 and python 3.9 (#6515)test_dataframe.py
is crashed if Calcite is disabled (#6517)botocore
as an optional dependency (#6521)read_excel
so that it doesn't use rich_text
param for old openpyxl
(#6534)s3fs<2023.9.0
(#6536)s3fs<2023.9.0
(#6544)ValueError: buffer source array is read-only
for iloc
(#6538)read_csv
with iterator=True
(#6554)apply
(#6673)Series.groupby.agg
(#6613)sort_values
shouldn't affect source dataframe/series (#6603)join
to avoid distributing a dict object
warning (#6612).sort_values()
(#6608)groupby.apply
in case of experimental groupby (#6649)read_csv
: treat object dtype as string (#6636)skiprows
parameter usage for read_excel
(#6638)modin.numpy.array.sum
on HDK (#6643)modin/experimental/sql/hdk/query.py
part of modin package (#6646)Series.between
works correctly (#6656)navigation_with_keys=True
to fix docs build (#6681)@AndreyPavlenko @Egor-Krivov @Garra1980 @RehanSD @anmyachev @dchigarev @vnlitvinov
This release introduces modin.utils.execute
function to improve benchmarking experience, includes new version of HDK 0.9.
It also includes performance optimizations for sort_values
, value_counts
, 2D setitem and several others, as well as many bug fixes.
ray.get()
inside of the kernel executing call queues (#6633)FutureWarning
s in rolling
unless necessary (#6586)Series.groupby.agg
(#6613)join
to avoid distributing a dict object
warning (#6612)DataFrame.agg()
(#6606).sort_values()
(#6608)FutureWarning
s for first/last/bool
(#6625)groupby.diff()
for dates (#6631)groupby.apply
in case of experimental groupby (#6649)read_csv()
: treat object dtype as string (#6636)skiprows
parameter usage for read_excel
(#6638)modin.numpy.array.sum
on HDK (#6643)modin/experimental/sql/hdk/query.py
part of modin package (#6646)Series.between
works correctly (#6656)navigation_with_keys=True
to fix docs build (#6681)from_pandas()
for numerical data in Ray (#6640)sort_values
by reducing the number of partitions (#6589)MODIN_CPUS
instead of os.cpu_count()
for the fragment size calculation (#6615)LazyProxyCategoricalDtype
materialization on merge
(#6630)dot
operation (#6644)value_counts()
: Eliminate redundant sorting. (#6654)random_integers
func (#6623)pytest
to print warnings in tests output (#6621)execute
to trigger lazy computations and wait for them to complete (#6648)materialize
parameter for partition.ip
func (#6650)pyhdk
version to 0.9 (#6676)@AndreyPavlenko @Egor-Krivov @Garra1980 @YarShev @anmyachev @dchigarev