Repository of KDD Cup, 2018.
Data preprocess then split the dataset into training, val and aggr dataset.
Data preprocess
Steps of data preprocess:
Split the data
All data points that are valid after data preprocess will be split into 3 parts : training set, validation set and aggregation set.
Training set is used for the training of single models, and usually data from 20170101-20180328 will be used in the training set.
Validation set will be used for selecting the best single models from the checkpoints of all single models. Then all best single models will be aggregated on the validation set and eveluated finally on the aggregation set. The aggregation model will be used for the final prediction.
Why oversampling?
Symmetric mean absolute percentage error (SMAPE) is used in this competation as evaluation metric. In SMAPE, relative error matters rather than absolut error, as shown in the function.
However, loss functions like L1 loss, L2 loss and huber loss are applied in different models and they all aim at decreasing absolute error rather than relative error. So if models are trained using original data and these 3 loss functions, trained models would be optimized to fit data points with huge number rather than data points with smaller numbers, which would lead to larger SAMPE when evaluating with validation set and test with test set.
Oversamping Strategies
Training data from 20170101-20180328 are used in the training data. Oversampling steps are as follows:
Oversample_part and repeats are hyperparameters which suitable values can be found by random search or grid search. Oversampling lead to a 0.02~0.04 improvement on SMAPE of validation set.
Seq2seq model is a machine learning model that use decoder and encoder to learn serialized feature pattern from data. Seq2seq model is applied to a lot of machine learning applications, especially NLP applications like Machine translation. In this project, seq2seq is applied to generate time series forecast of different granularity, which are Day model and Hour model. The basic graph of seq2seq model is as follows.
Day model
The air condition seem to be very cyclical every day, as shown in the 3rd part in bj_aq_data_exploration and below. So the basic seq2seq model would be Day model, which means that we just predict the mean value of all aq parameters in the next 2 days, and then overlay the parameter trend during 24 hours to generate the final prediction.
PM2.5 | PM10 | O3 | NO2 |
---|---|---|---|
|
|
|
|
The computation graph of Day model is as follws.
Hour model, Predicting 2 days together
Hour model, Predicting 1 day at a time