DTCleaner: data cleaning using multi-target decision trees.
It has been recognized that poor data quality can have multiple negative impact to enterprises [1]. Businesses operating on dirty data are in risk of causing large amount of financial loses. Maintaining data quality can also increases operational cost as business would need to spend time and resources to detect erroneous data and correct them. As data grows bigger these days, data repairing has became an important problem and an important research area.
DTCleaner produces multi-target decision trees for the purpose of data cleaning. It's built for detecting erroneous tuples in the dataset based on given set of conditional functional dependencies (CFDs) and building a classification model to predict erroneous tuples such that the "cleaned" dataset satisfies the CFDs, and semantically correct.
Consider the following schema:
hosp(ProviderNum, HospName, Addr, City, State, ZIP, County, Phone, HospType,
HospOwner, EmergencySerivce, Condition, MeasureName, StateAvg).
The data of this schema was taken from the US Department of Health & Human Services website. Here a hosp tuple contains 14 values attributes describing provider- level data for measures of different care ( heart attack care, heart failure care, surgical care, ...) and the following conditional functional dependencies (CFDs) used to detect erroneous tuples
CFD1 : hosp([Zip = 36545] -> [City = Jackson])
CFD2 : hosp([Zip = 94110] -> [City = San Francisco])
where CFD1 (resp. CFD2) asserts that if the zip code is 94115 (resp. 36545), then the city name must be San Francisco (resp. Jackson).
... | HospName | Addr | City | State | Zip | ... |
---|---|---|---|---|---|---|
... | Jackson Medical Ctr | 220 Hospital Drive | Jackson | AL | 36545 | |
... | Jackson Medical Ctr | 220 Hospital Drive | Jakson | AL | 36545 | |
... | SanFran Hospital | 1001 Potrero Ave | San Francisco | CA | 94110 | |
... | Cali Pacific Medical Ctr | 3555 Cesar St | San Fran | CA | 94110 |
Consider tuple 2 and 4. Tuple 2 (resp. 4) satisfies the premise of CFD1 (resp. CFD2) but they disagree in the right hand side (RHS) values (the city is misspelled in the case of tuple 2 and shortened in the case of tuple 4). At this point we know that the entries in t2[City], t2[Zip], t4[City], t4[Zip] are dirty and need to be cleaned. The problem is more complicated than simply changing the values of city to match the RHS values of the CFD, because we are unsure which attribute is the wrong one (zip code, city, or maybe both) in the first place!
The system takes in the following inputs:
The system would first perform the CFD violating detection, and separates the CFD-violating-tuples and inserts them in our test set. We then end up with a clean (non-CFD-violating tuples) set that we would use for training our model, and set of violating tuples that we will use for making predictions.
It's worth noting that DTCleaner focuses more on what happens after you acquire a set of valid and consistency set of CFDs by the user (this could be from experts with enough knowledge on the bussiness logic, or using a CFD discovery algrothim), and how to repair violating tuples so that we are closer to a dataset that is consistent with our CFDs. The repairing part (determining what attribute to change and what's the new value) is done with machine learning.
Make sure the following libraries are in your build path.
guava-18.0.jar
weka-src.jar
weka.jar
Usage: DTCleaner <input.arff> <CFDinput>
Eample: DTCleaner data/hospitalFewerAttr20PercentNoiseOn4.arff.arff data/CFDs
git checkout -b my-new-feature
git commit -am 'Add some feature'
git push origin my-new-feature
@misc{mustafa2015dtcleaner,
title = {DTCleaner for data quality},
author = {Abualsaud, Mustafa},
url = {https://github.com/AmmsA/DTCleaner},
year = {2015}
}
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
[1] Thomas C. Redman. The impact of poor data quality on the typical enterprise. Commun. ACM, 41(2):79{82, February 1998.