Resources for tackling record linkage / deduplication / data matching problems
Resources for tackling record linkage (also known as deduplication, data matching, entity resolution)
Note: If you're looking for file deduplication software, you're in the wrong place! This page focuses on deduplicating datasets.
Also note: Nor is this page is not about deduplication software used in backup and storage.
Record linkage attempts to identify duplicate records in messy data. It is a thorny problem that crops up in a variety of scenarios that attempt to understand with real-world entities (most often people), such as census and statistical bureaus, medical organizations, the social sciences, and of course commercial business.
For example, are these records the same person? Record Linkage is how you make the computer decide--quickly.
Name | Address | Phone |
---|---|---|
Bill Smith | 123 N. Main St. | 555-1235 |
Smith, William K. | 123 Main | - |
W. K. Smith | North Main Street | 222-555-1234 |
Bill Schmidt | 1230 Main St. | 542-1235 |
(last updated, stars)
Suggestions / contributions welcome! I am not an expert on record linkage, this is simply a list of things I've found when working on a difficult deduplication problem for Thicket.io.