Recordlinkage Versions Save

A powerful and modular toolkit for record linkage and duplicate detection in Python

v0.16

9 months ago

A new release of recordlinkage after a long time (too long, I'm sorry). This release bumps the minor version to 0.16. This version supports pandas 2 and pandas 1. It doesn't contain any structural changes or improvements to the API.

What's Changed

New Contributors

Full Changelog: https://github.com/J535D165/recordlinkage/compare/v0.15...v0.16

v0.15

2 years ago
  • Remove deprecated recordlinkage classes (#173)
  • Bump min Python version to 3.6, ideally 3.8+ (#171)
  • Bump min pandas version to >=1
  • Resolve deprecation warnings for numpy and pandas
  • Happy lint, sort imports, format code with yapf
  • Remove unnecessary np.sort in SNI algorithm (#141)
  • Fix bug for cosine and qgram string comparisons with threshold (#135)
  • Fix several typos in docs (#151)(#152)(#153)(#154)(#163)(#164)
  • Fix random indexer (#158)
  • Fix various deprecation warnings and broken docs build (#170)
  • Fix broken docs build due to pandas depr warnings (#169)
  • Fix broken build and removed warning messages (#168)
  • Update narrative
  • Replace Travis by Github Actions (#132)
  • Fix broken test NotFittedError
  • Fix bug in low memory random sampling and add more tests (#130)
  • Add extras_require to setup.py for deps management
  • Add banner to README and update title
  • Add Binder and Colab buttons at tutorials (#174)

Special thanks to Tomasz Waleń @twalen and other contributors for their work on this release.

v0.14

4 years ago
  • Drop Python 2.7 and Python 3.4 support. (#91)
  • Upgrade minimal pandas version to 0.23.
  • Simplify the use of all cpus in parallel mode. (#102)
  • Store large example datasets in user home folder or use environment variable. Before, example datasets were stored in the package. (see issue #42) (#92)
  • Add support to write and read annotation files for recordlinkage ANNOTATOR. See the docs and https://github.com/J535D165/recordlinkage-annotator for more information.
  • Replace .labels by .codes for pandas.MultiIndex objects for newer versions of pandas (>0.24). (#103)
  • Fix totals for pandas.MultiIndex input on confusion matrix and accuracy metrics. (see issue #84) (#109)
  • Initialize Compare with (a list of) features (Bug). (#124)
  • Various updates in relation to deprecation warnings in third-party libraries such as sklearn, pandas and networkx.

v0.13.2

5 years ago

Fix distribution problem.

v0.13

5 years ago

v0.11.2

6 years ago
  • Minor installation improvement. Exclude unwanted files

v0.11.1

6 years ago
  • Fix installation issue. Submodule 'preprocessing' was not added to the source distribution.

v0.11.0

6 years ago
  • The submodule 'standardise' is renamed. The new name is 'preprocessing'. The submodule 'standardise' will get deprecated in a next version.
  • Deprecation errors were not visible for many users. In this version, the errors are better visible.
  • Improved and new logs for indexing, comparing and classification.
  • Faster comparing of string variables. Thanks Joel Becker.
  • Changes make it possible to pickle Compare and Index objects. This makes it easier to run code in parallel. Tests were added to ensure that pickling remains possible.
  • Important change. MultiIndex objects with many record pairs were split into pieces to lower memory usage. In this version, this automatic splitting is removed. Please split the data yourself.
  • Integer indexing. Blog post will follow on this.
  • The metrics submodule has changed heavily. This will break with the previous version.
  • repr() and str() will return informative information for index and compare objects.
  • It is possible to use abbreviations for string similarity methods. For example 'jw' for the Jaro-Winkler method.
  • The FEBRL dataset loaders can now return the true links as a pandas.MultIndex for each FEBRL dataset. This option is disabled by default. See the FEBRL datasets for details.
  • Fix issue with automatic recognision of license on Github.
  • Various small improvements.

Note: In the next release, the Pairs class will get removed. Migrate now.

v0.10.1

6 years ago
  • print statement in the geo compare algorithm removed.
  • String, numeric and geo compare functions now raise directly when an incorrect algorithm name is passed.
  • Fix unit test that failed on Python 2.7.

v0.10.0

6 years ago
  • A new compare API. The new Compare class no longer takes the datasets and pairs as arguments. The actual computation is now performed when calling .compute(PAIRS, DF1, DF2). The documentation is updated as well, but still needs improvement.
  • Two new string similarity measures are added: Smith Waterman (smith_waterman) and Longest Common Substring (lcs). Thanks to Joel Becker and Jillian Anderson from the Networks Lab of the University of Waterloo.
  • Added and/or updated a large amount of unit tests.
  • Various small improvements.