Dataprep Versions Save

Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.

v0.4.4

1 year ago

Bugfixes 🐛

  • eda: type error for npartitions (57db1ede)
  • eda.create-db-report: remove pystache dependency and replace it with jinja2 (676fff1a)
  • eda.create-db-report: add missing style files from previously ignored by gitignore (75361915)
  • eda: jinja2.markup import broken with 3.1 (b9b60a0a)
  • eda: fixed create_report browser sort rendering issue, returned context values directly instead of selecting by css class (331a9644)
  • eda: report for empty df (485e58d3)
  • eda: plot_diff when columns are not aligned (7e53dbf6)
  • eda: scipy version issue (8798a146)
  • eda: na column name when upgrade dask (43fdd1a6)
  • eda: pd grouper issue when upgrade dask (761c4455)
  • clean: delete abundant print (0e072a80)
  • eda.plot: fix display issue in notebook (6ed13b09)
  • eda.plot: fix pagination styling issues (8396f2d9)
  • eda: restyled plots into same row, set height + width of plots to be same (c6ffcd4d)
  • eda: interaction error in report for cat-only df (e60239a0)
  • eda: fix cat-cat error (94f70ef6)
  • eda: fix stat layout issue (5bb535d7)
  • eda.create_report: fix display issue in notebook (487659fd)
  • clean: remove usaddress library (c192ab43)
  • clean: fix the bug of am, pm (4c3b2312)
  • clean: fix the bug of am, pm (caf2b372)
  • eda: fixed issue where plots weren't rendering twice (fd3fd573)
  • eda: wordcloud setting in terminal (00901699)

Features ✨

  • clean: add updated version of rapidfuzz and python-crfsuite (59f35066)
  • eda.create-db-report: add save report functionality (2fb16ad6)
  • eda: add get_db_names (a7bf8206)
  • eda: added sorting feature for create_diff_report (8b187a6c)
  • eda: add running total for time series test (d0940726)
  • eda: add create_db_report submodule (9784cceb)
  • eda.plot: add pagination threshold and add auto jump in pagination navigation (cfdd0dec)
  • eda.create_report: add sort by approximate unique (5738db2a)
  • eda: add sort variables by alphabetical and missing (fb93493a)
  • clean: New version of GUI (6828807b)
  • eda: enriched show details tab by adding plots and overview statistics (eeb210db)

Code Quality + Testing 💯

  • eda: add test for npartition type error (5affd75a)
  • eda: add tests for intermediate compute functions (700add77)

Documentation 📃

  • eda: add the use-case of dataprep.eda for spark dataframe with ray (4bf14e7c)
  • clean: revise _init.py (02ede811)
  • clean: add doc of clean GUI (5e2f38ac)
  • eda.plot: add pagination for plot (c4cd4b97)
  • eda.create_report: remove old doc file (e1153cb1)
  • eda.create_report: convert rst docs file to ipynb and add additional docs for variables sort (bf39a568)
  • eda: add doc for getting imdt result (6fbcfe4c)
  • eda: add the doc of run dataprep.eda on Hadoop yarn (628686d5)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

🎉🎉 Thank you! 🎉🎉

v0.4.4-alpha.1

1 year ago

Bugfixes 🐛

  • eda.create-db-report: add missing style files from previously ignored by gitignore (75361915)
  • eda: jinja2.markup import broken with 3.1 (b9b60a0a)
  • eda: fixed create_report browser sort rendering issue, returned context values directly instead of selecting by css class (331a9644)
  • eda: report for empty df (485e58d3)
  • eda: plot_diff when columns are not aligned (7e53dbf6)
  • eda: scipy version issue (8798a146)
  • eda: na column name when upgrade dask (43fdd1a6)
  • eda: pd grouper issue when upgrade dask (761c4455)
  • clean: delete abundant print (0e072a80)
  • eda.plot: fix display issue in notebook (6ed13b09)
  • eda.plot: fix pagination styling issues (8396f2d9)
  • eda: restyled plots into same row, set height + width of plots to be same (c6ffcd4d)
  • eda: interaction error in report for cat-only df (e60239a0)
  • eda: fix cat-cat error (94f70ef6)
  • eda: fix stat layout issue (5bb535d7)
  • eda.create_report: fix display issue in notebook (487659fd)
  • clean: remove usaddress library (c192ab43)
  • clean: fix the bug of am, pm (4c3b2312)
  • clean: fix the bug of am, pm (caf2b372)
  • eda: fixed issue where plots weren't rendering twice (fd3fd573)
  • eda: wordcloud setting in terminal (00901699)

Features ✨

  • eda: added sorting feature for create_diff_report (8b187a6c)
  • eda: add running total for time series test (d0940726)
  • eda: add create_db_report submodule (9784cceb)
  • eda.plot: add pagination threshold and add auto jump in pagination navigation (cfdd0dec)
  • eda.create_report: add sort by approximate unique (5738db2a)
  • eda: add sort variables by alphabetical and missing (fb93493a)
  • clean: New version of GUI (6828807b)
  • eda: enriched show details tab by adding plots and overview statistics (eeb210db)

Code Quality + Testing 💯

  • eda: add tests for intermediate compute functions (700add77)

Documentation 📃

  • clean: revise _init.py (02ede811)
  • clean: add doc of clean GUI (5e2f38ac)
  • eda.plot: add pagination for plot (c4cd4b97)
  • eda.create_report: remove old doc file (e1153cb1)
  • eda.create_report: convert rst docs file to ipynb and add additional docs for variables sort (bf39a568)
  • eda: add doc for getting imdt result (6fbcfe4c)
  • eda: add the doc of run dataprep.eda on Hadoop yarn (628686d5)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

🎉🎉 Thank you! 🎉🎉

v0.4.3

2 years ago

Bugfixes 🐛

  • eda: fixed create_report browser sort rendering issue, returned context values directly instead of selecting by css class (331a9644)
  • eda: report for empty df (485e58d3)
  • eda: plot_diff when columns are not aligned (7e53dbf6)
  • eda: scipy version issue (8798a146)
  • eda: na column name when upgrade dask (43fdd1a6)
  • eda: pd grouper issue when upgrade dask (761c4455)
  • clean: delete abundant print (0e072a80)
  • eda.plot: fix display issue in notebook (6ed13b09)
  • eda.plot: fix pagination styling issues (8396f2d9)
  • eda: restyled plots into same row, set height + width of plots to be same (c6ffcd4d)
  • eda: interaction error in report for cat-only df (e60239a0)
  • eda: fix cat-cat error (94f70ef6)
  • eda: fix stat layout issue (5bb535d7)
  • eda.create_report: fix display issue in notebook (487659fd)
  • clean: remove usaddress library (c192ab43)
  • clean: fix the bug of am, pm (4c3b2312)
  • clean: fix the bug of am, pm (caf2b372)
  • eda: fixed issue where plots weren't rendering twice (fd3fd573)
  • eda: wordcloud setting in terminal (00901699)

Features ✨

  • eda: added sorting feature for create_diff_report (8b187a6c)
  • eda: add running total for time series test (d0940726)
  • eda: add create_db_report submodule (9784cceb)
  • eda.plot: add pagination threshold and add auto jump in pagination navigation (cfdd0dec)
  • eda.create_report: add sort by approximate unique (5738db2a)
  • eda: add sort variables by alphabetical and missing (fb93493a)
  • clean: New version of GUI (6828807b)
  • eda: enriched show details tab by adding plots and overview statistics (eeb210db)

Code Quality + Testing 💯

  • eda: add tests for intermediate compute functions (700add77)

Documentation 📃

  • clean: revise _init.py (02ede811)
  • clean: add doc of clean GUI (5e2f38ac)
  • eda.plot: add pagination for plot (c4cd4b97)
  • eda.create_report: remove old doc file (e1153cb1)
  • eda.create_report: convert rst docs file to ipynb and add additional docs for variables sort (bf39a568)
  • eda: add doc for getting imdt result (6fbcfe4c)
  • eda: add the doc of run dataprep.eda on Hadoop yarn (628686d5)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

🎉🎉 Thank you! 🎉🎉

v0.4.2

2 years ago

Bugfixes 🐛

  • eda: na column name when upgrade dask (43fdd1a6)
  • eda: pd grouper issue when upgrade dask (761c4455)
  • clean: delete abundant print (0e072a80)
  • eda.plot: fix display issue in notebook (6ed13b09)
  • eda.plot: fix pagination styling issues (8396f2d9)
  • eda: restyled plots into same row, set height + width of plots to be same (c6ffcd4d)
  • eda: interaction error in report for cat-only df (e60239a0)
  • eda: fix cat-cat error (94f70ef6)
  • eda: fix stat layout issue (5bb535d7)
  • eda.create_report: fix display issue in notebook (487659fd)
  • clean: remove usaddress library (c192ab43)
  • clean: fix the bug of am, pm (4c3b2312)
  • clean: fix the bug of am, pm (caf2b372)
  • eda: fixed issue where plots weren't rendering twice (fd3fd573)
  • eda: wordcloud setting in terminal (00901699)

Features ✨

  • eda.plot: add pagination threshold and add auto jump in pagination navigation (cfdd0dec)
  • eda.create_report: add sort by approximate unique (5738db2a)
  • eda: add sort variables by alphabetical and missing (fb93493a)
  • clean: New version of GUI (6828807b)
  • eda: enriched show details tab by adding plots and overview statistics (eeb210db)

Code Quality + Testing 💯

  • eda: add tests for intermediate compute functions (700add77)

Documentation 📃

  • clean: add doc of clean GUI (5e2f38ac)
  • eda.plot: add pagination for plot (c4cd4b97)
  • eda.create_report: remove old doc file (e1153cb1)
  • eda.create_report: convert rst docs file to ipynb and add additional docs for variables sort (bf39a568)
  • eda: add doc for getting imdt result (6fbcfe4c)
  • eda: add the doc of run dataprep.eda on Hadoop yarn (628686d5)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

🎉🎉 Thank you! 🎉🎉

v0.4.1

2 years ago

v0.4.1

Bugfixes 🐛

  • eda: stat layout in plot (946319f7)
  • eda: fix display in plot(df) (c11bb94c)
  • eda: report for pandas extension type (2cbb3873)
  • eda: fix saving imdt as json file (5ee6529f)

Features ✨

  • clean: Add wiki and simple GUI(7f4ab12a)
  • eda: added overview and variables section for create_diff_report (dc4cf7da)
  • eda: add categorical interaction in create_report (7f13cd57)

Code Quality + Testing 💯

  • eda: added basic automated tests (3a0653e0)

Documentation 📃

  • eda: link creete_diff_report to intro (05d9850b)
  • eda: added docs for create_diff_report (d8fc9d4b)
  • eda: enrich parameters in report (3d0a148a)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

🎉🎉 Thank you! 🎉🎉

v0.4.0

2 years ago

v0.4.0

Bugfixes 🐛

  • eda: fix string type (b7e3321f)
  • eda: fix value table display (57281bc2)
  • eda: remove imdt output from plot (5c227e15)
  • eda: adjusted save report method to accept one parameter (4ceefcc1)
  • eda: clean config code and fix scatter sample param (8ab27f92)
  • plot_diff: fix ci issue (44ce81cf)
  • clean: clean_duplication issue 646 (ca9f7085)
  • eda: fix category type error (9750694a)

Features ✨

  • eda: refactored code and added density parameter to plot_diff(df) (323ae6b0)
  • eda: save imdt as json file (78673867)
  • connector: integrate connectorx into connector (106457e3, a64e3563, 9f89d3bf)
  • clean: add clean_ml function (909cd196)
  • clean: add multiple clean functions for number types (3c05be58)
  • eda.diff: add plot_diff([df1..dfn], continuous) (3bfb4f57)
  • clean: support conversion into packed binary format in clean_ip (7e30f93f, 37a83b03)

Code Quality + Testing 💯

  • eda: add densify test and doc for diff (f8d2054d)
  • eda: add test for config (ab3172f5)

Performance 🚀

  • clean: update documentation of clean_duplication (50f90fa9)

Documentation 📃

  • clean: change the introduction (862b4478)
  • eda: change eda colab position (ce25b17d, d00b0bd5)
  • clean: add documentation for multiple clean functions for number types (732480f1)
  • clean: add documentation for clean_ml function (0c139db6)
  • eda: scattter.sample_rate added to documentation (549b3193)
  • eda: fix plot show (0b40a40f)
  • readme: add benchmark link (e807f798)
  • readme: small text change on clean and connector (e193a6a7)
  • readme: fix titanc link (29cc06cc)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

🎉🎉 Thank you! 🎉🎉

v0.3.0

2 years ago

v0.3.0

Bugfixes 🐛

  • eda: fix long name in missing heatmap (f6cc399e)
  • connector: fix bug in url_path_params (c95a7ff1)
  • eda: fix NA and int viz issue in plot_diff (ef36d5ac)
  • eda: fix missing for SmallCard and DateTime type (201e487b)
  • eda: fix create_report for dask csv (93e85673)
  • clean: fix mixesd up formats of date in one column (e2956956)
  • eda: fixed uncaught dtype and long var names (24f0295e)
  • eda: fix correlation of num columns with small distinct values (9959b78a)
  • eda: fix issue with dataframe of one column (910bb71a)
  • eda: add geopoint in type count (94cbca23)
  • eda: fixed uncaught dtype exceptions (d301eb75)
  • eda: fix str transform with small distinct as categorical (65e7f907)
  • eda: fix na values display issue (1ce5775e)
  • eda: keep na when preprocess df (17d82191)
  • clean: fix returned df_clean in clean_dupl (180e6ad2)
  • clean: escape apostrophes in code exported by clean_dupl (e6ea7e97)
  • eda: fixed endless loop and UI issues (69779cd6)
  • eda: fix insight error (9ad4e26b)
  • eda: suppress warnings for missing and report (df2a1e70)
  • eda: fix insights of plot_correlation (f0ca5f41)
  • eda: suppress warnings of progress bar and dask (ca8da4e1)
  • eda.create_report: fix constant column error (160844ad)
  • docs: fix docs of clean_df (38dd4b2a)
  • clean: remove unneeded replace in clean_dupl (51c02cdd)
  • eda: fixed bugs come with random generated datasets (53ecf76c)
  • eda: fix bugs in log transformation (209d7d0c)
  • eda: fixed and optimized css layouts (58e1b18f)
  • clean: fix bug in validate_country (28068d46)
  • eda: fix column name and index related issues (40a89b91)
  • eda: variables can be none (325b0904)
  • connector: path to new config repo (59603e5b)
  • clean: lat_long regex not match a date format (49d3d227)
  • eda.distribution: highlight variable names (998b1762)
  • eda: fix the error of numerical cell in object column (91c4f9df)
  • eda.distribution: box plot with object dtype (a37e9f21)
  • clean: add comma after street suffix or name (e7655db9)
  • clean: cast values as str in validate funcs (8e1b459a)

Features ✨

  • clean: tuple of input formats for clean_country() (6bc65513)
  • clean: add clean_text function (55d3ae95)
  • eda: change color of geo map (1dbcddbf)
  • clean: add clean_currency function (deb55938)
  • clean: add clean_df() function (b750284f)
  • type: detect column as categorical for small unique values (4696e598)
  • eda: add geo_plot function (bbe64ec2)
  • eda: create_report UI improvement (c849b013)
  • eda: added new function plot_diff (79523c30)
  • connector: allow parameters appear in url path (5adaf301)
  • eda: value frequency table (bc37b794)
  • eda: create_report UI improvement (72a0ca95)
  • clean: add clean_duplication() function (98ff38d0)
  • clean: support letters in clean_phone (25d163b3)
  • eda: specify colors in plot(df), plot(df, x) (33fa36ea)
  • connector: add functionality that lists supported websites (88187e18)
  • clean: add clean_address function (e839ecd3)
  • clean: add clean_headers function (40742a19)
  • eda: parameter management and how-to guide (d2e8b10a)
  • clean: add clean_date function (6aa6410e)
  • create_report: add tabs for correlation and missing (6dc568b5)

Code Quality + Testing 💯

  • eda: add test for geo point (943033a6)
  • eda: add dataset test for report (0de5208b)
  • eda: add test of random df (68239f03)
  • clean: add tests for clean_duplication() (a4b9d32b)
  • eda: add random data generator (e83f95b3)
  • clean: add tests for clean_headers (0aca076e)
  • eda: add test case of object column with numerical cell (57839841)
  • clean) : add tests for clean_date and validate_date (812dbb8d)

Performance 🚀

  • eda: optimize df preprocess and performance of create_report (e7eb182f)
  • clean: update documentation of clean_date (c540fcc7)
  • clean: improve performance of clean_duplication (8fda37e8)
  • eda: use approximate nunique (60300644)
  • clean: improve the peformace of clean_email() (176382bc)
  • clean: improve performance of clean_date (854329ba)

Documentation 📃

  • readme: update video, paper and titanic report for eda (1126dea8)
  • eda: replace x, y, z with col1, col2, col3 (57f65b30)
  • clean: add documentation for clean_text (65436b06)
  • eda: add documentation for insights (1e4659be)
  • clean: add documentation for clean_df() (4ecf0d71)
  • eda: update user guide's datasets (2428f98e)
  • eda: add documentation for geo plot (3558257c)
  • clean: add user guide for clean_duplication (d834e857)
  • clean: fix clean documentation (e3bed2ba)
  • connector: revision (23085dd3)
  • clean: add documentation for clean_date function (d445f36a)
  • connector: add info docs (cb8cb5c5)
  • connector: add config file section (f55226ea)
  • connector: adding a process overview via DBLP section (5794d6c8)
  • connector: remove stale rst files (433fdfe4)
  • connector: convert pagination section from rst to ipynb (e4b9ba0c)
  • connector: convert authorization section from rst to ipynb (d25af473)
  • connector: change the pointer in index file from connector.rst to introduction.ipynb (218e41c6)
  • connector: rewrite introduction and form doc structure (6a876937)
  • connector: update API reference doc (9bed1694)
  • clean: improve DataPrep.Clean ReadMe (a0bc96b0)
  • eda: update legacy documentations for eda (8f948e05)
  • clean: add documentation for clean_address (4061fca3)
  • clean: add documentation for clean_headers (7a9d519c)
  • clean: add links from user guide to api ref (182b5254)
  • clean: Docstrings for phone and email (47f1e33d)
  • datasets: add introduction for datasets (83d42cee)
  • clean: add API reference (68182f6a)
  • clean: add documentation for clean_ip function (9da3ed1e)
  • connector: add query() section (c904d1fc)
  • connector: add connect() section (bff842ed)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

🎉🎉 Thank you! 🎉🎉

v0.2.15

3 years ago

Bugfixes 🐛

  • eda: add test to plot_missing (303a13e6)
  • eda: when data size is small using plot_missing (9e59aa00)
  • eda: set encoding to udf when file is opened (f43c1aa2)
  • clean: split parameter for clean_phone (f9bb1003)
  • connector: config manager checks _meta.json (5c2278de)
  • eda.create_report: univar datetime analysis (4632852a)
  • eda.report: encoding and show issue (721ae7be)

Features ✨

  • datasets: add load_dataset and get_dataset_names (2b9e1f95)
  • connector: allow using config from other branches (276afff3)
  • connector: from_key parameter validation (bd89ef29)
  • clean: add clean_ip function (3b232708)
  • connector: improve info (2a175a82)
  • eda: enrich plot_correlation (29c444e2)
  • clean: implement clean_phone for Canadian/US formats (45d43682)
  • eda: modify doc of plot_missing (489c9220)
  • clean: add errors parameter, enhance report for clean_url (aa7ec9cb)
  • clean: add clean_url function (2894d0a0)
  • eda: add stat. in plot_missing (0f44f153)
  • connector: adding validation for auth params (0a7c712d)
  • eda: convert all plot functions to new UI (36f8fa3e)
  • connector: update info function documentation (7b6ae530)
  • connector: create display dataframe function (9767cf47)

Code Quality + Testing 💯

  • clean: add tests for clean_ip and validate_ip (fc156829)
  • clean: add tests for clean_url (452dbe8f)
  • clean: add tests for clean_phone (fcf73106)
  • clean: add tests for clean_email() (fdd02c62)
  • clean: add tests for clean_country() (8a593fa6)
  • clean: add tests for clean_lat_long (aea26025)

Performance 🚀

  • clean: improve the peformace of the clean subpackage (c7c787bd)

Documentation 📃

  • README: add link to each section (b687076a)
  • README: polish EDA section (fd5ef8c4)
  • clean: add documentation for clean_url (bf937f9d)
  • clean: add documentation for clean_phone (8165a428)
  • readme: fix the broken image (12e1fa16)
  • readme: add introduction for dataprep.clean (3710037d)
  • clean: add docs for clean_country (21639814)
  • eda: modify doc for plot_correlation (b6b377c9)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

🎉🎉 Thank you! 🎉🎉

v0.2.14

3 years ago

Bugfixes 🐛

  • eda.plot_missing: new label texts and color mapping (71a95f91)
  • connector: add missing authdef (8b274b92)
  • eda.create_report: handle unhashable dtypes (77437491)

Features ✨

  • connector: remove jsonschema dependency (6f07faf9)
  • connector: don't support xml website anymore (fa173a06)
  • connector: simplify generator, add connect (a96d9b3c)
  • clean: implement clean_country function (5dea1bde)
  • connector: do not update local config if it already exists (cd675f30)
  • eda: Redesigned layout for plot_missing (c85eaa5d)
  • connector: add generator UI (4d1e9004)

Performance 🚀

  • eda: optimize plot_missing and plot_corr (b46036dc)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

🎉🎉 Thank you! 🎉🎉

v0.2.13

3 years ago

Bugfixes 🐛

  • eda: change dtype 'string' to 'object' (8ddddbcf)
  • eda: remove unecessary compute (98c4ab0c)
  • connector: wrong calculation for pagination (516038b9)
  • eda.data_array: handle empty df correctly (97db86d7)
  • eda.distribution: fix pie chart insight (d3564a6f)
  • eda.distribution: delay scipy computations (89fafaec)
  • eda.correlation: wrong mask calculation (8ebe9cc0)
  • eda.plot: fixed wordcloud, all nan column (ce762d55)

Features ✨

  • connector: implement authorization code (e6838ca1)
  • connector: full text search _q to be a universal parameter (947584ab)
  • cleaning: add clean_email() function (4658a208)
  • connector: implement generator (7a93ea0e)
  • connector: add token based pagination (5ec6e00c)
  • connector: implement page pagination (02c93b4e)
  • connector: implement header authentication (d879c207)
  • connector: use pydantic for schema (dff08442)
  • connector: rename pagination types (500ce130)
  • cleaning: add report parameter for clean_lat_long (f0af6212)
  • connector: Parameter check when calling query() (0db7a16b)
  • eda: support series as the input (bad6a873)
  • eda.plot: Redesigned layout for plot(df, x) (04c7fd55)
  • cleaning: clean latitude, longitude coordinates (93927a98)
  • eda.report: allow disabling the progress bar (2a90f7f3)
  • eda.correlation: move nan corr values to the bottom (4bba52e0)
  • eda: add progress bar for dask local scheduler (e13257c8)
  • eda.plot: increase # of bins and ngroups (f78cfaef)

Performance 🚀

  • eda.plot: changed drop_null to dropna (0a7fe56d)
  • eda.missing: use DataArray (fb69ea1b)
  • eda.plot: optimize bivariate computations (031748e9)
  • eda: improve progress bar performance (64be8895)
  • eda.correlation: increase the performance (3575aac4)
  • eda.correlation: performance tuning (68471e50)

Documentation 📃

  • cleaning: add documentation for clean_email() (5bc37706)
  • cleaning: update clean_lat_long docs (d698a10e)
  • cleaning: add documentation for clean_lat_long (eaba8c71)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

🎉🎉 Thank you! 🎉🎉