Data stream analytics: Implement online learning methods to address concept drift and model drift in data streams using the River library. Code for the paper entitled "PWPAE: An Ensemble Framework for Concept Drift Adaptation in IoT Data Streams" published in IEEE GlobeCom 2021.
This is the code for the paper entitled "PWPAE: An Ensemble Framework for Concept Drift Adaptation in IoT Data Streams" published in 2021 IEEE Global Communications Conference (GLOBECOM), doi: 10.1109/GLOBECOM46510.2021.9685338.
Authors: Li Yang, Dimitrios Michael Manias, and Abdallah Shami
Organization: The Optimized Computing and Communications (OC2) Lab, ECE Department, Western University
This repository also introduces concept drift definitions and online machine learning methods for data stream analytics using the River library.
A complete tutorial code for the comprehensive and complete pipeline for concept drift, online machine learning, and data stream analytics, including dynamic data pre-processing, drift-based dynamic feature selection, dynamic model learning & selection, and online ensemble models, can be found in: MSANA-Online-Data-Stream-Analytics-And-Concept-Drift-Adaptation
Another tutorial code for concept drift, online machine learning, and data stream analytics can be found in: OASW-Concept-Drift-Detection-and-Adaptation
In non-stationary and dynamical environments, such as IoT environments, the distribution of input data often changes over time, known as concept drift. The occurrence of concept drift will result in the performance degradation of the current trained data analytics model. Traditional offline machine learning (ML) models cannot deal with concept drift, making it necessary to develop online adaptive analytics models that can adapt to the predictable and unpredictable changes in data streams.
To address concept drift, effective methods should be able to detect concept drift and adapt to the changes accordingly. Therefore, concept drift detection and adaptation are the two major steps for online learning on data streams.
Adaptive Windowing (ADWIN) is a distribution-based method that uses an adaptive sliding window to detect concept drift based on data distribution changes. ADWIN identifies concept drift by calculating and analyzing the average of certain statistics over the two sub-windows of the adaptive window. The occurrence of concept drift is indicated by a large difference between the averages of the two sub-windows. Once a drift point is detected, all the old data samples before that drift time point are discarded.
from river.drift import ADWIN
adwin = ADWIN()
Drift Detection Method (DDM) is a popular model performance-based method that defines two thresholds, a warning level and a drift level, to monitor model's error rate and standard deviation changes for drift detection.
from river.drift import DDM
ddm = DDM()
Hoeffding tree (HT) is a type of decision tree (DT) that uses the Hoeffding bound to incrementally adapt to data streams. Compared to a DT that chooses the best split, the HT uses the Hoeffding bound to calculate the number of necessary samples to select the split node. Thus, the HT can update its node to adapt to newly incoming samples.
from river import tree
model = tree.HoeffdingTreeClassifier(
grace_period=100,
split_confidence=1e-5,
...
)
Extremely Fast Decision Tree (EFDT), also named Hoeffding Anytime Tree (HATT), is an improved version of the HT that splits nodes as soon as it reaches the confidence level instead of detecting the best split in the HT.
from river import tree
model = tree.ExtremelyFastDecisionTreeClassifier(
grace_period=100,
split_confidence=1e-5,
min_samples_reevaluate=100,
...
)
Adaptive random forest (ARF) algorithm uses HTs as base learners and ADWIN as the drift detector for each tree to address concept drift. Through the drift detection process, the poor-performing base trees are replaced by new trees to fit the new concept.
from river import ensemble
model = ensemble.AdaptiveRandomForestClassifier(
n_models=3,
drift_detector=ADWIN(),
...
)
Streaming Random Patches (SRP) uses the similar technology of ARF, but it uses the global subspace randomization strategy, instead of the local subspace randomization technique used by ARF. The global subspace randomization is a more flexible method that improves the diversity of base learners.
from river import ensemble
base_model = tree.HoeffdingTreeClassifier(
grace_period=50, split_confidence=0.01,
...
)
model = ensemble.SRPClassifier(
model=base_model, n_models=3, drift_detector=ADWIN(),
...
)
Leverage bagging (LB) is another popular online ensemble that uses bootstrap samples to construct base learners. It uses Poisson distribution to increase the data diversity and leverage the bagging performance.
from river import ensemble
from river import linear_model
from river import preprocessing
model = ensemble.LeveragingBaggingClassifier(
model=(
preprocessing.StandardScaler() |
linear_model.LogisticRegression()
),
n_models=3,
...
)
As the number of Internet of Things (IoT) devices and systems have surged, IoT data analytics techniques have been developed to detect malicious cyber-attacks and secure IoT systems; however, concept drift issues often occur in IoT data analytics, as IoT data is often dynamic data streams that change over time, causing model degradation and attack detection failure. This is because traditional data analytics models are static models that cannot adapt to data distribution changes. In this paper, we propose a Performance Weighted Probability Averaging Ensemble (PWPAE) framework for drift adaptive IoT anomaly detection through IoT data stream analytics. Experiments on two public datasets show the effectiveness of our proposed PWPAE method compared against state-of-the-art methods.
IoTID20 dataset, a novel IoT botnet dataset
CICIDS2017 dataset, a popular network traffic dataset for intrusion detection problems
For the purpose of displaying the experimental results in Jupyter Notebook, the sampled subsets of the two datasets are used in the sample code. The subsets are in the "data" folder.
Please feel free to contact us for any questions or cooperation opportunities. We will be happy to help.
If you find this repository useful in your research, please cite this article as:
L. Yang, D. M. Manias and A. Shami, "PWPAE: An Ensemble Framework for Concept Drift Adaptation in IoT Data Streams," 2021 IEEE Global Communications Conference (GLOBECOM), 2021, pp. 1-6, doi: 10.1109/GLOBECOM46510.2021.9685338.
@INPROCEEDINGS{9685338,
author={Yang, Li and Manias, Dimitrios Michael and Shami, Abdallah},
booktitle={2021 IEEE Global Communications Conference (GLOBECOM)},
title={PWPAE: An Ensemble Framework for Concept Drift Adaptation in IoT Data Streams},
year={2021},
pages={1-6},
doi={10.1109/GLOBECOM46510.2021.9685338}
}