Modin: Scale your Pandas workflows by changing a single line of code
This release introduces modin.utils.execute
function to improve benchmarking experience, includes new version of HDK 0.9.
It also includes performance optimizations for sort_values
, value_counts
, 2D setitem and several others, as well as many bug fixes.
ray.get()
inside of the kernel executing call queues (#6633)FutureWarning
s in rolling
unless necessary (#6586)Series.groupby.agg
(#6613)join
to avoid distributing a dict object
warning (#6612)DataFrame.agg()
(#6606).sort_values()
(#6608)FutureWarning
s for first/last/bool
(#6625)groupby.diff()
for dates (#6631)groupby.apply
in case of experimental groupby (#6649)read_csv()
: treat object dtype as string (#6636)skiprows
parameter usage for read_excel
(#6638)modin.numpy.array.sum
on HDK (#6643)modin/experimental/sql/hdk/query.py
part of modin package (#6646)Series.between
works correctly (#6656)navigation_with_keys=True
to fix docs build (#6681)from_pandas()
for numerical data in Ray (#6640)sort_values
by reducing the number of partitions (#6589)MODIN_CPUS
instead of os.cpu_count()
for the fragment size calculation (#6615)LazyProxyCategoricalDtype
materialization on merge
(#6630)dot
operation (#6644)value_counts()
: Eliminate redundant sorting. (#6654)random_integers
func (#6623)pytest
to print warnings in tests output (#6621)execute
to trigger lazy computations and wait for them to complete (#6648)materialize
parameter for partition.ip
func (#6650)pyhdk
version to 0.9 (#6676)@AndreyPavlenko @Egor-Krivov @Garra1980 @YarShev @anmyachev @dchigarev
This release upgrades the pandas version to 2.1, updates the minimum supported python version up to 3.9, introduces ModinDataLoader to improve interaction with PyTorch, fixes several issues with interchange protocol that solved known compatibility issues with Plotly, Seaborn and Altair, includes new version of HDK 0.8. It also includes some other new features, and many bug fixes.
ray>2.6.0
(#6425)read_csv
(#5507)read_excel
: defaults to pandas for unsupported types of 'io' (#6462)query
and eval
(#6488)Column.null_count
to return a built-in int
instead of NumPy scalar (#6526)unwrap_partitions
for virtual partitions when axis=None
(#6560)__getattribute__
for experimental mode (#6529)temp_df.dtype == 'category'
(#6360)Series.str.find/index/rfind/rindex
(#6426)copy
on empty DataFrame/Series objects (#6371)__array__
method always returns array of vanilla numpy (#6300)BenchmarkMode.put(True)
(#6365)groupby.size()
in reshuffling groupby (#6370)sum
operation (#6421)astype
calls for modin.array.sum
op (#6395)DataFrame.mean()
result (#6520)__setitem__
op when using not hashable key (#6547)__factory
to None
in case of any problems during initialization (#6397)diff
(#6403)disable_logging
to __getattr__
(#6406)read_feather
with pyarrow<11.0
(#6415)flake8==6.1.0
(#6428)pymssql==2.2.8
from environments (#6430)~
in paths in IO functions correctly (#6448)sum|mean|median
groupby aggregations (#6444)fastparquet>=2023.1.0
(#6458)groupby.apply()
for UDFs that change the output's shape (#6506)is_bool_dtype()
for categorical (#6480)__array_ufunc__
(#6486)test_sort_cols_str
from test_dataframe.py crashed on HDK 0.7.0 and python 3.9 (#6515)botocore
as an optional dependency (#6521)read_excel
so that it doesn't use rich_text
param for old openpyxl
(#6534)s3fs<2023.9.0
(#6536)s3fs<2023.9.0
(#6544)read_parquet
(#6545)ValueError: buffer source array is read-only
for iloc
(#6538)dfsql
module (#6550)FutureWarning
s in groupby
unless necessary (#6595)read_csv
with iterator=True
(#6554).read_parquet()
(#6559)MODIN_OMNISCI_*
env vars in favor of MODIN_HDK_*
(#6562)map
function via applymap
(#6566)FutureWarning
s in bfill/backfill/ffill/pad
unless necessary (#6599)sort_values
shouldn't affect source dataframe/series (#6603)concat
operation (#6381)_repartition
(#6376)numpy.array
operations in internals of iloc/loc
operation (#6393)__getitem__
when the number of rows to be taken > 90% (#6423).dropna()
using map-reduce pattern (#6472)reindex
(#6438)qc.to_datetime()
(#6525)query()
(#6584).from_pandas()
(#6591)BasePandasDataset.apply
(#6451)isort
(#6551)Patcher
internal class (#6471)__invert__
(#6490)contextlib.nullcontext
instead of custom one (#6570)is_int64_dtype
and is_period_dtype
function (#6577)time_groupby_agg_nunique
ASV bench (#6564)psycopg2-binary
for testing and developing purpose (#6573)df.eval
with scalar and groupby.transofm
call in the expr (#6546)repr
to force materialization (#6461)numexpr<2.8.5
(#6474)boto3
from environments to speedup creation (#6496)read_parquet
supported parameters (#6420)dataframe.insert
function (#6400)DataLoader
interplay. (#6140).modin
folder (#6390)to_parquet
(#6404)read_parquet
(#6442)nlargest/nsmallest
groupby aggregation (#6485)datetime64
to int64
cast (#6501)enable_multifrag_execution_result=1
HDK launch parameter (#6503)@AndreyPavlenko @RehanSD @YarShev @anmyachev @dchigarev @mvashishtha @vnlitvinov @abykovsk @zmbc @noloerino @rentruewang
This release contains fixes that improve Modin's performance for both the NumPy and pandas APIs, as well as removes the Modin In the Cloud experimental feature. This release also includes upgrades to Modin's testing suite that significantly speed up CI.
Series.str.find/index/rfind/rindex
(#6426)diff
(#6403)disable_logging
to __getattr__
(#6406)@AndreyPavlenko @RehanSD @YarShev @anmyachev @dchigarev @mvashishtha @vnlitvinov
Modin 0.23.0
This release upgrades the pandas version to 2.0. It also includes '.corr' speed-up, new features, and bug fixes.
con
parameter for to_sql
(#5940)read_json
in case of rows having different columns (#5946)read_excel
and unpin openpyxl
(#6247)Series.equals
/DataFrame.equals
with NA entries (#6270)wait
method for Dask/Ray/Unidist wrappers (#6049)groupby.rolling
API (#6292)@AndreyPavlenko @YarShev @alexbaden @anmyachev @dchigarev @kurapov-peter @mvashishtha @vnlitvinov
Patch release with main point of pinning pydantic<2 to resolve Ray issues, plus a few bugfixes.
@AndreyPavlenko @anmyachev
This release includes support for pandas 2.0, '.corr' speed-up, new features and bug fixes.
Note: this is a release candidate. If everything goes well, we'll release Modin 0.23.0 in two weeks.
read_json
in case of rows having different columns (#5946)read_excel
and unpin openpyxl
(#6247)wait
method for Dask/Ray/Unidist wrappers (#6049)@AndreyPavlenko @YarShev @anmyachev @dchigarev @mvashishtha @vnlitvinov
This release includes several bug fixes.
to_dict
(https://github.com/modin-project/modin/pull/6260)astype("category")
causing read-only buffer error (https://github.com/modin-project/modin/pull/6267)@mvashishtha
This release includes a bug fix.
@mvashishtha
This release includes support for pyhdk=0.6, a few performance enhancements, new features and bug fixes.
@mvashishtha @AndreyPavlenko @anmyachev @dchigarev @jkew @YarShev