Machine learning based causal inference/uplift in Python
causeinfer is a Python package for estimating average and conditional average treatment effects using machine learning. The goal is to compile causal inference models both standard and advanced, as well as demonstrate their usage and efficacy - all this with the overarching ambition to help people learn causal inference techniques across business, medical, and socioeconomic fields. See the documentation for a full outline of the package including the available models and datasets.
⇧
causeinfer can be downloaded from PyPI via pip or sourced directly from this repository:
pip install causeinfer
git clone https://github.com/andrewtavis/causeinfer.git
cd causeinfer
python setup.py install
import causeinfer
⇧
Separate models for treatment and control groups are trained and combined to derive average treatment effects (Hansotia, 2002).
from causeinfer.standard_algorithms.two_model import TwoModel
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
tm_pred = TwoModel(
treatment_model=RandomForestRegressor(**kwargs),
control_model=RandomForestRegressor(**kwargs),
)
tm_pred.fit(X=X_train, y=y_train, w=w_train)
# An array of predictions given a treatment and control model
tm_preds = tm_pred.predict(X=X_test)
tm_proba = TwoModel(
treatment_model=RandomForestClassifier(**kwargs),
control_model=RandomForestClassifier(**kwargs),
)
tm_proba.fit(X=X_train, y=y_train, w=w_train)
# An array of predicted treatment class probabilities given models
tm_probas = tm.predict_proba(X=X_test)
An interaction term between treatment and covariates is added to the data to allow for a basic single model application (Lo, 2002).
from causeinfer.standard_algorithms.interaction_term import InteractionTerm
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
it_pred = InteractionTerm(model=RandomForestRegressor(**kwargs))
it_pred.fit(X=X_train, y=y_train, w=w_train)
# An array of predictions given a treatment and control interaction term
it_preds = it_pred.predict(X=X_test)
it_proba = InteractionTerm(model=RandomForestClassifier(**kwargs))
it_proba.fit(X=X_train, y=y_train, w=w_train)
# An array of predicted treatment class probabilities given interaction terms
it_probas = it_proba.predict_proba(X=X_test)
Units are categorized into two or four classes to derive treatment effects from favorable class attributes (Lai, 2006; Kane, et al, 2014; Shaar, et al, 2016).
# Binary Class Transformation
from causeinfer.standard_algorithms.binary_transformation import BinaryTransformation
from sklearn.ensemble import RandomForestClassifier
bt = BinaryTransformation(model=RandomForestClassifier(**kwargs), regularize=True)
bt.fit(X=X_train, y=y_train, w=w_train)
# An array of predicted probabilities (P(Favorable Class), P(Unfavorable Class))
bt_probas = bt.predict_proba(X=X_test)
# Quaternary Class Transformation
from causeinfer.standard_algorithms.quaternary_transformation import (
QuaternaryTransformation,
)
from sklearn.ensemble import RandomForestClassifier
qt = QuaternaryTransformation(model=RandomForestClassifier(**kwargs), regularize=True)
qt.fit(X=X_train, y=y_train, w=w_train)
# An array of predicted probabilities (P(Favorable Class), P(Unfavorable Class))
qt_probas = qt.predict_proba(X=X_test)
Weighted versions of the binary class transformation approach that are meant to dampen the original model's inherently noisy results (Shaar, et al, 2016).
# Reflective Uplift Transformation
from causeinfer.standard_algorithms.reflective import ReflectiveUplift
from sklearn.ensemble import RandomForestClassifier
ru = ReflectiveUplift(model=RandomForestClassifier(**kwargs))
ru.fit(X=X_train, y=y_train, w=w_train)
# An array of predicted probabilities (P(Favorable Class), P(Unfavorable Class))
ru_probas = ru.predict_proba(X=X_test)
# Pessimistic Uplift Transformation
from causeinfer.standard_algorithms.pessimistic import PessimisticUplift
from sklearn.ensemble import RandomForestClassifier
pu = PessimisticUplift(model=RandomForestClassifier(**kwargs))
pu.fit(X=X_train, y=y_train, w=w_train)
# An array of predicted probabilities (P(Favorable Class), P(Unfavorable Class))
pu_probas = pu.predict_proba(X=X_test)
⇧
Comparisons across stratified, ordered treatment response groups are used to derive model efficiency.
from causeinfer.evaluation import plot_cum_gain, plot_qini
visual_eval_dict = {
"y_test": y_test,
"w_test": w_test,
"two_model": tm_effects,
"interaction_term": it_effects,
"binary_trans": bt_effects,
"quaternary_trans": qt_effects,
}
df_visual_eval = pd.DataFrame(visual_eval_dict, columns=visual_eval_dict.keys())
model_pred_cols = [
col for col in visual_eval_dict.keys() if col not in ["y_test", "w_test"]
]
fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=False, figsize=(20, 5))
plot_cum_effect(
df=df_visual_eval,
n=100,
models=models,
percent_of_pop=True,
outcome_col="y_test",
treatment_col="w_test",
normalize=True,
random_seed=42,
axis=ax1,
legend_metrics=True,
)
plot_qini( # or plot_cum_gain
df=df_visual_eval,
n=100,
models=models,
percent_of_pop=True,
outcome_col="y_test",
treatment_col="w_test",
normalize=True,
random_seed=42,
axis=ax2,
legend_metrics=True,
)
Hillstrom Metrics
Mayo PBC Metrics
CMF Microfinance Metrics
Easily iterate models to derive their average effects and prediction variances. See a full example across all datasets and models in examples/model_iteration, with the results being shown below:
TwoModel | InteractionTerm | BinaryTransformation | QuaternaryTransformation | ReflectiveUplift | PessimisticUplift | |
---|---|---|---|---|---|---|
Hillstrom | -5.4762 ± 13.589*** | -5.047 ± 15.417*** | 0.5178 ± 15.7252*** | 0.7397 ± 14.7509*** | 4.4872 ± 18.5918**** | -6.0052 ± 17.936**** |
Mayo PBC | -0.145 ± 0.29 | -0.1335 ± 0.4471 | 0.5542 ± 0.4268 | 0.5315 ± 0.4424 | -0.8774 ± 0.233 | 0.1392 ± 0.3587 |
CMF Microfinance | 18.7289 ± 5.9138** | 17.0616 ± 6.6993** | nan | nan | nan | nan |
⇧
from causeinfer.data import hillstrom
hillstrom.download_hillstrom()
data_hillstrom = hillstrom.load_hillstrom(
user_file_path="datasets/hillstrom.csv", format_covariates=True, normalize=True
)
df = pd.DataFrame(
data_hillstrom["dataset_full"], columns=data_hillstrom["dataset_full_names"]
)
from causeinfer.data import mayo_pbc
mayo_pbc.download_mayo_pbc()
data_mayo_pbc = mayo_pbc.load_mayo_pbc(
user_file_path="datasets/mayo_pbc.text", format_covariates=True, normalize=True
)
df = pd.DataFrame(
data_mayo_pbc["dataset_full"], columns=data_mayo_pbc["dataset_full_names"]
)
from causeinfer.data import cmf_micro
data_cmf_micro = cmf_micro.load_cmf_micro(
user_file_path="datasets/cmf_micro", format_covariates=True, normalize=True
)
df = pd.DataFrame(
data_cmf_micro["dataset_full"], columns=data_cmf_micro["dataset_full_names"]
)
⇧
Please see the contribution guidelines if you are interested in contributing to this project. Work that is in progress or could be implemented includes:
Adding more baseline models and datasets (see issues)
Converting GRF files to Python and connecting them to the C++ boiler plate
Adding a data simulator (see issue)
Finding more causal inference datasets to be added (see issue)
Adding a predict
method to binary_transformation and quaternary_transformation
Updating and refining the documentation
Improving tests for greater code coverage
Improving code quality by refactoring large functions and checking conventions
Python
Other Languages
Data and Misc
⇧
Big Data and Machine Learning
Causal Inference
Uplift