Showcase for using H2O and R for churn prediction (inspired by ZhouFang928 examples)
Showcase for using H2O and R for churn prediction (inspired by ZhouFang928 examples).
ZhouFang928 in a blog post Telco Customer Churn with R in SQL Server 2016 presented a great analysis of telco customer churn prediction. I found it missed one of my favorite machine-learning library H2O in the comparison. This showcase presents how easy it is to use H2O library to build very good quality predictive models.
I have used:
Instalation of the packages requires Rtools compatible with your R version.
Install dependencies for the project
rsuite proj depsinst
It will result in the following output
2017-09-23 20:39:18 INFO:rsuite:Detecting repositories (for R 3.3)...
2017-09-23 20:39:20 WARNING:rsuite:Project is configured to use non reliable repositories: S3. You should use only reliable repositories to be sure of project consistency over time.
2017-09-23 20:39:20 INFO:rsuite:Will look for dependencies in ...
2017-09-23 20:39:20 INFO:rsuite:. MRAN#1 = http://mran.microsoft.com/snapshot/2017-09-23 (win.binary, source)
2017-09-23 20:39:20 INFO:rsuite:. S3#2 = http://h2o-release.s3.amazonaws.com/h2o/master/4034/R (source)
2017-09-23 20:39:20 INFO:rsuite:Collecting project dependencies (for R 3.3)...
2017-09-23 20:39:20 INFO:rsuite:Resolving dependencies (for R 3.3)...
2017-09-23 20:39:44 INFO:rsuite:Detected 29 dependencies to install. Installing...
2017-09-23 20:43:47 INFO:rsuite:All dependencies successfully installed.
Build custom packages
rsuite proj build
You should get the following output
2017-09-23 20:48:46 INFO:rsuite:Installing externalpackages (for R 3.3) ...
2017-09-23 20:48:51 INFO:rsuite:Installing modelbuilder (for R 3.3) ...
2017-09-23 20:48:57 INFO:rsuite:Successfuly build 2 packages
Run model training and evaluation
Rscript.exe R\build_telco_churn_model.R --nthreads=4 --max-mem="4g"
Please note that script has two parameters:
After succesful model building you can find it (in H2O format) in folder export
. It can be loaded in H2O Flow for further inspection.
I decided to go with Gradient Boosting Models. To select best model I used grid search for such parameters:
Best model was selected using AUC metric -- resulting in 100 trees with max depth equals 16. After model building I optimized threshold to maximize minimum per class accuracy.
Best model (with threshold selected to maximize min per class classification error) gave following results on test dataset:
Computation involved validating (using 5-fold cross validation) 12 GBM models with different parameters. On my laptop (Intel i7, 8GB RAM, Windows 10) it took around 25 minutes. Using Amazon's EC2 c4.4xlarge instance the time droped to around 14-15 minutes.
print
function.Folders: