Implementation of MTAD-GAT: Multivariate Time-series Anomaly Detection via Graph Attention Network
Implementation of MTAD-GAT: Multivariate Time-series Anomaly Detection via Graph Attention Network
https://arxiv.org/pdf/2009.02040.pdf
Notes:
this can be trained in forecasting, reconstracting and both modes. Forecasting only gives good results.
run_mode flag to specify to train or predict in FORECASTING, RECONSTRUCTING or BOTH. if mode is trained in FORECASTING , it does not make any sense to predict other than FORECASTING and same about other modes!
clip gradients is set to 0.1 and learning rate 5e-6. You can start at 1e-5 and later reduce to 1e-6.
d3 is 18, it is kind of half of 38 features what is of SMD
reconstruction pdf taken and used for every time point in a time series for training. Inference - only last item is used
if reconstruction, anomaly log pdf is -reconstruction log pdf. it is not 1 - pdf
to calculate combined score , anomaly log pdf is calculated as explaned above. both forecasting square difference and reconstruction anomaly pdf are scaled to a range 0-1 and features are summed. gamma can be used to give more weight to forecasting or reconstracting. combined score can be used to show which feature caused anomaly the most.
final score is total of all feature scores. this is compared against threshold
train and test data is generated to GCP
models are stored to GCP
"anomaly_detection" is my GCP storage
I used some very minor utility content from BERT. This is why I put BERT Licence
I am discussing SMD only, but MSL and SMAP should be used as well
jupyter needs to be adjuster for concrete logs and paths
Future experiments:
bottleneck on VAE is 18 considering SMD 38 features. Different value can be experimented with. Paper specifies 300 bottleneck size, I didn't get it
training often gets into NaN cuased by backptopagation. I just restart in a loop. Gradient clipping and learning rate helps. May be some optimizer parameters could help further
Training steps:
Download ServerMachineDataset from https://github.com/NetManAIOps/OmniAnomaly Put ServerMachineDataset here at the root of the
Prepare tfrecords train and test data
python prepare_data.py --files_path=ServerMachineDataset/train --tfrecords_file=gs://anomaly_detection/mtad_tf/data/train/{}.tfrecords python prepare_data.py --files_path=ServerMachineDatasettest --label_path=ServerMachineDataset/test_label --tfrecords_file=gs://anomaly_detection/mtad_tf/data/test/{}.tfrecords
See losses in samples folder. Use provuded jupyternotebook, but adjust field positions, paths.
for c in {1..100}; do echo $c; time python training.py --action=TRAIN --train_file=gs://anomaly_detection/mtad_gat/data/train/machine-1-1.tfrecords --output_dir=gs://anomaly_detection/mtad_gat/output/machine-1-1-fore --run_mode=FORECASTING --num_train_steps=30000; done
for c in {1..100}; do echo $c; time python training.py --action=TRAIN --train_file=gs://anomaly_detection/mtad_gat/data/train/machine-1-1.tfrecords --output_dir=gs://anomaly_detection/mtad_gat/output/machine-1-1-reco --run_mode=RECONSTRACTING --num_train_steps=300000; done
for c in {1..100}; do echo $c; time python training.py --action=TRAIN --train_file=gs://anomaly_detection/mtad_gat/data/train/machine-1-1.tfrecords --output_dir=gs://anomaly_detection/mtad_gat/output/machine-1-1-both --run_mode=BOTH --num_train_steps=300000; done
See graphs in samples folder
python training.py --action=PREDICT --test_file=gs://anomaly_detection/mtad_gat/data/test/machine-1-1.tfrecords --output_dir=gs://anomaly_detection/mtad_gat/output/machine-1-1-fore --prediction_task=inference_score --run_mode=FORECASTING
python training.py --action=PREDICT --test_file=gs://anomaly_detection/mtad_gat/data/test/machine-1-1.tfrecords --output_dir=gs://anomaly_detection/mtad_gat/output/machine-1-1-fore --prediction_task=inference_score --run_mode=RECONSTRACTING
python training.py --action=PREDICT --test_file=gs://anomaly_detection/mtad_gat/data/test/machine-1-1.tfrecords --output_dir=gs://anomaly_detection/mtad_gat/output/machine-1-1-fore --prediction_task=inference_score --run_mode=BOTH
Use my another repository EVT_POT to calculate threshold. inference_score.csv is an input for that
Adjust and use jupyter notebook to grath inference_score.csv, initial threshold, POT and Best threshold. Adjust those values in the notebook
Anomaly Evaluation
I do not use Estimator EVAL to do this since I use an adjustment procedure. I use this same simple omnianomaly approach. If we predicted anomaly and it is within some anomaly segment, whole segment becomes correctly predicted. I do not use a prediction latency (delta) value as in some other papers. Also, I think practically, if we flagged anomaly earlier compare to actual anomaly label, it could be considered hit maybe using some delta as well. I did not do this way here. Please see samples folder for evaluation metrics for machine-1-1-{fore|reco|both}.results in samples folder
You must provide your own different threshold values
python training.py --action=PREDICT --test_file=gs://anomaly_detection/mtad_gat/data/test/machine-1-1.tfrecords --output_dir=gs://anomaly_detection/mtad_tf/output/machine-1-1-fore --threshold=1.5547085 --prediction_task=EVALUATE --run_mode=FORECASTING
python training.py --action=PREDICT --test_file=gs://anomaly_detection/mtad_gat/data/test/machine-1-1.tfrecords --output_dir=gs://anomaly_detection/mtad_tf/output/machine-1-1-reco --threshold=1.5547085 --prediction_task=EVALUATE --run_mode=RECONSTRACTING
python training.py --action=PREDICT --test_file=gs://anomaly_detection/mtad_gat/data/test/machine-1-1.tfrecords --output_dir=gs://anomaly_detection/mtad_tf/output/machine-1-1-both --threshold=1.5547085 --prediction_task=EVALUATE --run_mode=BOTH
Anomaly Prediction
This command creates Anomaly.csv in the local folder. You must provide your threshold values
python training.py --action=PREDICT --test_file=gs://anomaly_detection/mtad_gat/data/test/machine-1-1.tfrecords --output_dir=gs://anomaly_detection/mtad_tf/output/machine-1-1-fore --threshold=1.5547085 --run_mode=FORECASTING
python training.py --action=PREDICT --test_file=gs://anomaly_detection/mtad_gat/data/test/machine-1-1.tfrecords --output_dir=gs://anomaly_detection/mtad_tf/output/machine-1-1-fore --threshold=1.5547085 --run_mode=RECONSTRACTING
python training.py --action=PREDICT --test_file=gs://anomaly_detection/mtad_gat/data/test/machine-1-1.tfrecords --output_dir=gs://anomaly_detection/mtad_tf/output/machine-1-1-fore --threshold=1.5547085 --run_mode=BOTH