Recordlinkage Versions Save

A powerful and modular toolkit for record linkage and duplicate detection in Python

v0.9.0

6 years ago
  • A new index API. The new index API is no longer a single class (recordlinkage.Pairs(...)) with all the functionality in it. The new API is based on Tensorflow and FEBRL. With the new structure, it easier to parallise the record linkage process. In future releases, this will be implemented natively. See the reference page for more information and migrating. <http://recordlinkage.readthedocs.io/en/latest/ref-index.html>_
  • Significant speed improvement of the Sorted Neighbourhood Indexing algorithm. Thanks to @perryvais (PR #32).
  • The function binary_comparisons is renamed. The new name of the function is binary_vectors. Documentation added to RTD.
  • Added unit tests to test the generation of random comparison vectors.
  • Logging module added to separate module logs from user logs. The implementation is based on Tensorflow.

v0.8.1

7 years ago
  • Issues solved with rendering docs on ReadTheDocs. Still not clear what is going on with the autodoc_mock_imports in the sphinx conf.py file. Maybe a bug in sphinx.
  • Move six to dependencies.
  • The reference part of the docs is split into separate subsections. This makes the reference better readable.
  • The landing page of the docs is slightly changed.

v0.8.0

7 years ago
  • Add additional arguments to the function that downloads and loads the krebsregister data. The argument missing_values is used to fill missing values. Default: nothing is done. The argument shuffle is used to shuffle the records. Default is True.
  • Remove the lastest traces of the old package name. The new package name is 'Python Record Linkage Toolkit'
  • Better error messages when there are only matches or non-matches are passed to train the classifier.
  • Add AirSpeedVelocity tests to test the performance.
  • Compare for deduplication fixed. It was broken.
  • Parameterized tests for the Compare class and its algorithms. Making use of nose-parameterized module.
  • Update documentation about contributing.
  • Bugfix/improvement when blocking on multiple columns with missing values.
  • Fix bug #29. Package not working with pandas 0.18 and 0.17. Dropped support pandas 0.17 and fixed support for 0.18. Also added multi-dendency tests for TravisCI.
  • Support for dedicated deduplication algorithms
  • Special algorithm for full index in case of finding duplicates. Performce is 100x better.
  • Function max_number_of_pairs to get the maximum number of pairs.
  • low_memory for compare class.
  • Improved performance in case of comparing a large number of record pairs.
  • New documentation about custom algorithms
  • New documentation about the use of classifiers.
  • Possible to compare arrays and series directly without using labels.
  • Make a dataframe with random comparison vectors with the binary_comparisons in the recordlinkage.datasets.random module.
  • Set KMeans cluster centers by hand.
  • Various documentation updates and improvements.
  • Jellyfish is now a required dependency. Fixes bug #30.
  • Added tox.ini to test packaging and installation of package.
  • Drop requirements.txt file.
  • Many small fixes and changes. Most of the changes cover the Compare module. Especially label handling is improved.

v0.7.2

7 years ago

v0.7.1

7 years ago

v0.6.0

7 years ago

This version includes the following updates:

  • Reformatting the code such that it follows PEP8.
  • Add Travis-CI and codecov support.
  • Switch to distributing wheels.
  • Fix bugs with depreciated pandas functions. __sub__ is no longer used for computing the difference of Index objects. It is now replaced by ``INDEX.difference(OTHER_INDEX).
  • Exclude pairs with NaN's on the index-key in Q-gram indexing.
  • Add tests for krebsregister dataset.
  • Fix Python3 bug on krebsregister dataset.
  • Improve unicode handling in phonetic encoding functions.
  • Strip accents with the clean function.
  • Add documentation
  • Bug for random indexing with incorrect arguments fixed and tests added.
  • Improved deployment workflow
  • And much more

v0.5.0

7 years ago
  • Batch comparing added. Signifant speed improvement.
  • rldatasets are now included in the package itself.
  • Added an experimental gender imputation tool.
  • Blocking and SNI skip missing values
  • No longer need for different index names
  • FEBRL datasets included
  • Unit tests for indexing and comparing improved
  • Documentation updated

v0.4.0

7 years ago
  • Fixes a serious bug with deduplication (thanks to https://github.com/dserban).
  • Fixes undesired behaviour for sorted neighbourhood indexing with missing values.
  • Add new datasets to the package like Febrl datasets
  • Move Krebsregister dataset to this package.
  • Improve and add some tests
  • Various documentation updates

v0.3.1

7 years ago

v0.3

7 years ago

This version contains a lot of changes to the API. Hopefully, there are no large API changes needed for now.

  • Total restructure of compare functions (The end of changing the API is close to now.)
  • Compare method numerical is now named numeric and fuzzy is now named string.
  • Add haversine formula to compare geographical records.
  • Use numexpr for computing numeric comparisons.
  • Add step, linear and squared comparing.
  • Add eye index method.
  • Improve, update and add new tests.
  • Remove iterative indexing functions.
  • New add chunks for indexing functions. These chunks are defined in the class Pairs. If chunks are defined, then the indexing functions returns a generator with an Index for each element.
  • Update documentation.
  • Various bug fixes.