Python package for stacking (machine learning technique)
Python package for stacking (stacked generalization) featuring lightweight functional API and fully compatible scikit-learn API
Convenient way to automate OOF computation, prediction and bagging using any number of models
fit
and transform
methodssklearn.pipeline.Pipeline
FeatureUnion
is also invited to the party>>> help(stacking)
>>> help(StackingTransformer)
Note: Python 3.5 or higher is required. If you’re still using Python 2.7 or 3.4 see installation details here
pip install vecstack
pip install --user vecstack
/usr/bin/python -m pip install vecstack
C:/Python36/python -m pip install vecstack
pip install --upgrade vecstack
pip install --upgrade --no-deps vecstack
pip install --upgrade --no-deps https://github.com/vecxoz/vecstack/archive/master.zip
pip uninstall vecstack
from vecstack import stacking
# Get your data
# Initialize 1st level estimators
models = [LinearRegression(),
Ridge(random_state=0)]
# Get your stacked features in a single line
S_train, S_test = stacking(models, X_train, y_train, X_test, regression=True, verbose=2)
# Use 2nd level estimator with stacked features
from vecstack import StackingTransformer
# Get your data
# Initialize 1st level estimators
estimators = [('lr', LinearRegression()),
('ridge', Ridge(random_state=0))]
# Initialize StackingTransformer
stack = StackingTransformer(estimators, regression=True, verbose=2)
# Fit
stack = stack.fit(X_train, y_train)
# Get your stacked features
S_train = stack.transform(X_train)
S_test = stack.transform(X_test)
# Use 2nd level estimator with stacked features
stacking
function) or Scikit-learn API (StackingTransformer
)?
stacking
function and StackingTransformer
correspond?
(Randomized)GridSearchCV
to tune the whole stacking Pipeline?
random_state
work?
Just open an issue here.
Ask me anything on the topic.
I'm a bit busy, so typically I answer on the next day.
Just give me a star in the top right corner of the repository page.
@misc{vecstack2016,
author = {Igor Ivanov},
title = {Vecstack},
year = {2016},
publisher = {GitHub},
howpublished = {\url{https://github.com/vecxoz/vecstack}},
}
Stacking (stacked generalization) is a machine learning ensembling technique.
Main idea is to use predictions as features.
More specifically we predict train set (in CV-like fashion) and test set using some 1st level model(s), and then use these predictions as features for 2nd level model. You can find more details (concept, pictures, code) in stacking tutorial.
Also make sure to check out:
Often it is also called stacked generalization. The term is derived from the verb to stack (to put together, to put on top of each other). It implies that we put some models on top of other models, i.e. train some models on predictions of other models. From another point of view we can say that we stack predictions in order to use them as features.
It depends on specific business case. The main thing to know about stacking is that it requires significant computing resources. No Free Lunch Theorem applies as always. Stacking can give you an improvement but for certain price (deployment, computation, maintenance). Only experiment for given business case will give you an answer: is it worth an effort and money.
At current point large part of stacking users are participants of machine learning competitions. On Kaggle you can't go too far without ensembling. I can secretly tell you that at least top half of leaderboard in pretty much any competition uses ensembling (stacking) in some way. Stacking is less popular in production due to time and resource constraints, but I think it gains popularity.
I can just do the following. Why not?
model_L1 = XGBRegressor()
model_L1 = model_L1.fit(X_train, y_train)
S_train = model_L1.predict(X_train).reshape(-1, 1) # <- DOES NOT work due to overfitting. Must be CV
S_test = model_L1.predict(X_test).reshape(-1, 1)
model_L2 = LinearRegression()
model_L2 = model_L2.fit(S_train, y_train)
final_prediction = model_L2.predict(S_test)
Code above will give meaningless result. If we fit on X_train
we can’t just predict X_train
, because our 1st level model has already seen X_train
, and its prediction will be overfitted. To avoid overfitting we perform cross-validation procedure and in each fold we predict out-of-fold (OOF) part of X_train
. You can find more details (concept, pictures, code) in stacking tutorial.
OOF is abbreviation for out-of-fold prediction. It's also known as OOF features, stacked features, stacking features, etc. Basically it means predictions for the part of train data that model haven't seen during training.
Basically it is the same thing meaning machine learning algorithm. Often these terms are used interchangeably.
Speaking about inner stacking mechanics, you should remember that when you have single 1st level model there will be at least n_folds
separate models trained in each CV fold on different subsets of data. See Q23 for more details.
Basically it is the same thing. Both approaches use predictions as features.
Often this terms are used interchangeably.
The difference is how we generate features (predictions) for the next level:
vecstack package supports only stacking i.e. cross-validation approach. For given random_state
value (e.g. 42) folds (splits) will be the same across all estimators. See also Q30.
You can use for example:
scipy.optimize.minimize
scipy.optimize.differential_evolution
By default you can start from weighted average. It is easier to apply and more chances that it will give good result. Then you can try additional level which potentially can outperform weighted average (but not always and not in an easy way). Experiment is your friend.
Bagging or Bootstrap aggregating works as follows: generate subsets of training set, train models on these subsets and then find average of predictions.
Also term bagging is often used to describe following approaches:
So if we run stacking and just average predictions - it is bagging.
Note 1: The best architecture can be found only by experiment.
Note 2: Always remember that higher number of levels or models does NOT guarantee better result. The key to success in stacking (and ensembling in general) is diversity - low correlation between models.
It depends on many factors like type of problem, type of data, quality of models, correlation of models, expected result, etc.
Some example configurations are listed below.
L1: 2-10 models -> L2: weighted (rank) average or single model
L1: 10-50 models -> L2: 2-10 models -> L3: weighted (rank) average
L1: 100-inf models -> L2: 10-50 models -> L3: 2-10 models -> L4: weighted (rank) average
You can also find some winning stacking architectures on Kaggle blog, e.g.: 1st place in Homesite Quote Conversion.
Note 1: The best architecture can be found only by experiment.
Note 2: Always remember that higher number of levels or models does NOT guarantee better result. The key to success in stacking (and ensembling in general) is diversity - low correlation between models.
For some example configurations see Q16.
Based on experiments and correlation (e.g. Pearson). Less correlated models give better result. It means that we should never judge our models by accuracy only. We should also consider correlation (how given model is different from others). Sometimes inaccurate but very different model can add substantial value to resulting ensemble.
Nothing is wrong. Stacking is advanced complicated technique. It's hard to make it work. Solution: make sure to try weighted (rank) average first instead of additional level with some advanced models. Average is much easier to apply and in most cases it will surely outperform your best model. If still no luck - then probably your models are highly correlated.
stacking
function) or Scikit-learn API (StackingTransformer
)?Quick guide:
StackingTransformer
with familiar scikit-learn interface and logicstacking
function but remember that it does not store models and does not have scikit-learn capabilitiesStacking API comparison:
Property | stacking function | StackingTransformer |
---|---|---|
Execution time | Same | Same |
RAM | Consumes the smallest possible amount of RAM. Does not store models. At any point in time only one model is alive. Logic: train model -> predict -> delete -> etc. When execution ends all RAM is released. | Consumes much more RAM. It stores all models built in each fold. This price is paid for standard scikit-learn capabilities like Pipeline and FeatureUnion . |
Access to models after training | No | Yes |
Compatibility with Pipeline and FeatureUnion |
No | Yes |
Estimator implementation restrictions | Must have only fit and predict (predict_proba ) methods |
Must be fully scikit-learn compatible |
NaN and inf in input data |
Allowed | Not allowed |
Can automatically save OOF and log in files | Yes | No |
Input dimensionality (X_train , X_test ) |
Arbitrary | 2-D |
stacking
function and StackingTransformer
correspond?stacking function | StackingTransformer |
---|---|
models=[Ridge()] |
estimators=[('ridge', Ridge())] |
mode='oof_pred_bag' (alias 'A' ) |
variant='A' |
mode='oof_pred' (alias 'B' ) |
variant='B' |
StackingTransformer
you can easily create it via Pipeline
by adding on the top of StackingTransformer
some regressor or classifier.StackingTransformer
's on top of each other and then some final regressor or classifier.Note: Stacking usually takes long time. It's expected (inevitable) behavior.
We can compute total number of models which will be built during stacking procedure using following formulas:
n_models_total = n_estimators * n_folds
n_models_total = n_estimators * n_folds + n_estimators
Let's look at example. Say we define our stacking procedure as follows:
estimators_L1 = [('lr', LinearRegression()),
('ridge', Ridge())]
stack = StackingTransformer(estimators_L1, n_folds=4)
So we have two 1st level estimators and 4 folds. It means stacking procedure will build the following number of models:
X_train
.X_train
and 2 models on full X_train
.Compute time:
time_total = n_models_total * time_of_one_model
n_estimators=1
in formulas above) and then sum up times.You can find out only by experiment. Default choice is variant A, because it takes less time and there should be no significant difference in result. But of course you may also try variant B. For more details see stacking tutorial.
Note: Remember that higher number of folds substantially increase training time (and RAM consumption for StackingTransformer). See Q23.
Note 1: It is NOT allowed to change train set between calls to fit
and transform
methods. Due to stacking nature transformation is different for train set and any other set. If train set is changed after training, stacking procedure will not be able to correctly identify it and transformation will be wrong.
Note 2: To be correctly detected train set does not necessarily have to be identical (exactly the same). It must have the same shape and all values must be close (np.isclose
is used for checking). So if you somehow regenerate your train set you should not worry about numerical precision.
If you transform X_train
and see 'Train set was detected' everything is OK. If you transform X_train
but you don't see this message then something went wrong. Probably your train set was changed (it is not allowed). In this case you have to retrain StackingTransformer
. For more details see stacking tutorial or Q8.
Common convention: The very first bunch of models which are trained on initial raw data are called L1. On top of L1 we have so called stacker level or meta level or L2 i.e. models which are trained on predictions of L1 models. Count continues in the same fashion up to arbitrary number of levels.
I use this convention in my code and docs. But of course your Kaggle teammates may use some other naming approach, so you should clarify this for your specific case.
(Randomized)GridSearchCV
to tune the whole stacking Pipeline?Yes, technically you can, but it is not recommended because this approach will lead to redundant computations. General practical advice is to tune each estimator separately and then use tuned estimators on the 1st level of stacking. Higher level estimators should be tuned in the same fashion using OOF from previous level. For manual tuning you can use stacking
function or StackingTransformer
with a single 1st level estimator.
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import OneHotEncoder
def auc(y_true, y_pred):
"""ROC AUC metric for both binary and multiclass classification.
Parameters
----------
y_true : 1d numpy array
True class labels
y_pred : 2d numpy array
Predicted probabilities for each class
"""
ohe = OneHotEncoder(sparse=False)
y_true = ohe.fit_transform(y_true.reshape(-1, 1))
auc_score = roc_auc_score(y_true, y_pred)
return auc_score
random_state
work?To ensure better result, folds (splits) have to be the same across all estimators and all stacking levels. It means that random_state
has to be the same in every call to stacking
function or StackingTransformer
. This is default behavior of stacking
function and StackingTransformer
(by default random_state=0
). If you want to try different folds (splits) try to set different random_state
values.