Supplementary material for IJCNN paper "XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning"
Zhao, Y. and Hryniewicki, M.K., "XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning," International Joint Conference on Neural Networks (IJCNN), IEEE, 2018.
Please cite the paper as:
@inproceedings{zhao2018xgbod,
title={XGBOD: improving supervised outlier detection with unsupervised representation learning},
author={Zhao, Yue and Hryniewicki, Maciej K},
booktitle={2018 International Joint Conference on Neural Networks (IJCNN)},
pages={1--8},
year={2018},
organization={IEEE}
}
PDF | IEEE Explore | API Documentation | Example with PyOD
Update (Dec 25th, 2018): XGBOD has been officially released in Python Outlier Detection (PyOD) V0.6.6.
Update (Dec 6th, 2018): XGBOD has been implemented in Python Outlier Detection (PyOD), to be released in pyod V0.6.6.
Additional notes:
XGBOD is a three-phase framework (see Figure below). In the first phase, it generates new data representations. Specifically, various unsupervised outlier detection methods are applied to the original data to get transformed outlier scores as new data representations. In the second phase, a selection process is performed on newly generated outlier scores to keep the useful ones. The selected outlier scores are then combined with the original features to become the new feature space. Finally, an XGBoost model is trained on the new feature space, and its output decides the outlier prediction result.
The experiment code is writen in Python 3 and built on a number of Python packages:
Batch installation is possible using the supplied "requirements.txt":
pip install -r requirements.txt
Seven datasets are used (see dataset folder):
Datasets | Dimension | Sample Size | Number of Outliers |
---|---|---|---|
Arrhythmia | 351 | 274 | 126 (36%) |
Letter | 1600 | 32 | 100 (6.25%) |
Cardio | 1831 | 21 | 176 (9.6%) |
Speech | 3686 | 600 | 61(1.65%) |
Satellite | 6435 | 36 | 2036 (31.64%) |
Mnist | 7603 | 100 | 700 (9.21%) |
Mammography | 11863 | 6 | 260 (2.32%) |
All datasets are accessible at http://odds.cs.stonybrook.edu/. Citation Suggestion for the datasets please refer to:
Shebuti Rayana (2016). ODDS Library [http://odds.cs.stonybrook.edu]. Stony Brook, NY: Stony Brook University, Department of Computer Science.
Experiments could be reproduced by running xgbod_demo.py directly. You could simply download/clone the entire repository and execute the code by "python xgbod_demo.py".
The first part of the code read in the datasets using Scipy. Five public outlier datasets are supplied. Then various TOS are built by seven different algorithms:
Taking KNN as an example, codes are as below:
# Generate TOS using KNN based algorithms
feature_list, roc_knn, prc_n_knn, result_knn = get_TOS_knn(X_norm, y, k_range, feature_list)
Then three TOS selection methods are used to select p TOS:
p = 10 # number of selected TOS
# random selection
X_train_new_rand, X_train_all_rand = random_select(X, X_train_new_orig, roc_list, p)
# accurate selection
X_train_new_accu, X_train_all_accu = accurate_select(X, X_train_new_orig, feature_list, roc_list, p)
# balance selection
X_train_new_bal, X_train_all_bal = balance_select(X, X_train_new_orig, roc_list, p)
Finally, various classification methods are applied to the datasets. Sample outputs are provided below:
Running plots.py would generate the figures for various TOS selection algorithms: