R package for automation of machine learning, forecasting, model evaluation, and model interpretation
I moved the feature engineering functions to a new package called Rodeo, which I have pinned within my repositories
Rodeo stands for R Optimized Data Engineering Operations.
Release v0.6.0 is now available.
New function: ModelInsightsReport()
I have a new function available for generating an exhaustive model insights report for regression, classification, and multiclass models.
The sections are broken out into different groups
Model Evaluation Metrics
Model Evaluation Plots
Model Interpretation Plots
Model MetaData
# Create some dummy correlated data
data <- RemixAutoML::FakeDataGenerator(
Correlation = 0.85,
N = 10000,
ID = 2,
ZIP = 0,
FactorCount = 2,
AddDate = FALSE,
Classification = FALSE,
MultiClass = FALSE)
# Copy data
data1 <- data.table::copy(data)
# GPU or CPU
TaskType <- 'GPU'
# Target Variable
Target <- 'Adrian'
# ID Column Names
IDcols <- c('IDcol_1', 'IDcol_2')
# Feature Column Names
Features <- names(data1)[!names(data1) %in% c(Target, IDcols)]
# Run function
RemixOutput <- RemixAutoML::AutoCatBoostRegression(
# Metadata args
OutputSelection = c('Importances', 'EvalPlots', 'EvalMetrics', 'Score_TrainData'),
ModelID = 'Test_Model_1',
task_type = TaskType,
NumGPUs = 1,
NumOfParDepPlots = length(Features),
# Data args
data = data1,
TargetColumnName = Target,
FeatureColNames = Features,
IDcols = IDcols,
TransformNumericColumns = Target,
Methods = c('Asinh', 'Asin', 'Log', 'LogPlus1', 'Sqrt'),
# ML args
Trees = 200,
Depth = 4)
# Build the html report (you'll have to open it from your directory that you saved it in)
RemixAutoML::ModelInsightsReport(
# Meta info
TargetColumnName = Target,
PredictionColumnName = 'Predict',
FeatureColumnNames = Features,
DateColumnName = NULL,
# Control options
RemixOutput = RemixOutput,
TargetType = 'regression',
ModelID = 'Test_Model_1',
Algo = 'catboost',
SourcePath = getwd(),
OutputPath = getwd())
New function: UserBaseEvolution()
This release includes the new function AutoLagRollMode() which will generate lags and rolling modes for categorical variables.
# NO GROUPING CASE: Create fake Panel Data----
Count <- 1L
for(Level in LETTERS) {
datatemp <- RemixAutoML::FakeDataGenerator(
Correlation = 0.75,
N = 25000L,
ID = 0L,
ZIP = 0L,
FactorCount = 2L,
AddDate = TRUE,
Classification = FALSE,
MultiClass = FALSE)
datatemp[, Factor1 := eval(Level)]
if(Count == 1L) {
data <- data.table::copy(datatemp)
} else {
data <- data.table::rbindlist(
list(data, data.table::copy(datatemp)))
}
Count <- Count + 1L
}
# NO GROUPING CASE: Create rolling modes for categorical features
data <- RemixAutoML::AutoLagRollMode(
data,
Lags = seq(1,5,1),
ModePeriods = seq(2,5,1),
Targets = c("Factor_1"),
GroupingVars = NULL,
SortDateName = "DateTime",
WindowingLag = 1,
Type = "Lag",
SimpleImpute = TRUE)
# GROUPING CASE: Create fake Panel Data----
Count <- 1L
for(Level in LETTERS) {
datatemp <- RemixAutoML::FakeDataGenerator(
Correlation = 0.75,
N = 25000L,
ID = 0L,
ZIP = 0L,
FactorCount = 2L,
AddDate = TRUE,
Classification = FALSE,
MultiClass = FALSE)
datatemp[, Factor1 := eval(Level)]
if(Count == 1L) {
data <- data.table::copy(datatemp)
} else {
data <- data.table::rbindlist(
list(data, data.table::copy(datatemp)))
}
Count <- Count + 1L
}
# GROUPING CASE: Create rolling modes for categorical features
data <- RemixAutoML::AutoLagRollMode(
data,
Lags = seq(1,5,1),
ModePeriods = seq(2,5,1),
Targets = c("Factor_1"),
GroupingVars = "Factor_2",
SortDateName = "DateTime",
WindowingLag = 1,
Type = "Lag",
SimpleImpute = TRUE)
This release is to let users know that I removed all dependencies that aren't necessary to install RemixAutoML. I listed the dependencies on the README. Other dependencies will need to be installed depending on which functions you will be using. However, with this new approach, you won't have to install everything to use only a few functions. One of the main benefits includes creating docker images will be much easier and lightweight for your use case.
This release includes the addition of:
Code examples can be found on the README under the Funnel Forecasting section and in the help files for the respective functions.
Background The term funnel forecasting is the process of forecasting the periods out from cohort start dates and across calendar time. The functions in RemixAutoML allow you to forecast these types of processes for single series and grouped series. The available functions utilize CatBoost, LightGBM, and XGBoost. Typically these forecasting projects are centered around the sales funnel but could be applied to any cohort style data structure. There are two primiary reasons to utilize the functions in this package over some alternatives out there. One, they utilize machine learning algorithms whereas the alternative methods only utilize glm's at best, and simple averaging more commonly. Second, there are tons of feature engineering opportunities with this kind of data structure that are altogether ignored with other methods.
Feature engineering The feature engineering that go into these functions include calendar and cohort date features (e.g. day of week, week of month, month of year, etc.), holiday features for both calendar and cohort dates, and time series features that cover both calendar and cohort dates (lags and rollings stats). The lags and rolling stats across cohort dates is what makes these functions really unique. In the Panel CARMA functions in RemixAutoML, lags and rolling stats are generated for calendar time. Here, I also take advantage of cohort time. There are also automatic categorical encoding methods for LightGBM and XGBoost for categorical variables. CatBoost handles categorical variables internally. There are also automatic transformations that can be utilized and the functions manage the conversion and backtransform for you automatically. XREGS (exogenous variables) are also permitted and they must be attached to the base funnel data. The XREGS need to span the entire forecast horizon.
Data structure Typical data sets begin with some sort of base funnel measure, such as leads. The conversion measures of interest typically include sales or intermiate steps between leads and sales. What the functions do internally is predict the conversion rates across cohort time and calendar time. Once all periods are forecasted, the conversion measure is also computed. Model insights are saved to file so you can inspect the driving factors to the cohort process and the model performance measure.
The data structure the functions expect will have columns such as, 'CalendarDateColumn', 'CohortDateColumn', 'CohortPeriodsOut', 'Leads', 'Appointments'. If you have group variables, they would also be columns. The data should be in long format - this means that for every 'CalendarDateColumn' there will be a bunch of corresponding 'CohortDateColumn' dates values. This makes sense since for each cohort there will be many periods out where conversion measures are generated. The CohortPeriodsVariable are values that represent the number of numeric units from the cohort date base value. Exmaple - if a single cohort is for the calendar date '2020-01-01' and the corresponding cohort date is '2020-01-10' then the CohortPeriodsVariable will have a value of 10 (numeric or integer).
API For this forecasting use case I split out training and forecasting process into two separate functions for each ML method. Auto__FunnelCARMA() (for model training) and Auto__FunnelCARMAScoring() (forecasting) are the two methods to be aware of.
ML parameters Similarly to the other ML functions, most ML args are exposed with the functions so you can tune them in a ton of ways. You can also run them with a GPU if you've installed the GPU versions of the packages (relevant for XGBoost and LightGBM).
Usage for business There are several additional benefits of forecasting using the Funnel models vs converting the data to standard panel data strucutres. Business groups are often interesting in individual cohorts and they utilize that information for not only planning but also to adjust strategies and identify issues with existing strategies. Anomaly detection can also be conducted by comparing forecasts to actuals when new data is made available, which is another way to help the business get ahead of issues before they because significant.
LightGBM has been added to the RemixAutoML supervised learning model suite.
I've exposed almost all of the parameters available for lightgbm. For grid tuning, I've exposed the following parameters: num_iterations (trees), eta (learning rate), max_depth, num_leaves, min_data_in_leaf, bagging_freq, bagging_fraction, feature_fraction, feature_fraction_bynode, lambda_l1, lambda_l2. Similar to the other methods, the trees, depth, and learning rate are the parameters used to create the buckets for grid tuning while the others are randomly dispersed within those buckets. I force a limit of 5 parameter values for trees, depth, and learning rate while the others are limited to 3. If you supply more I will truncate based on the first N values.
As always, check out the README or the help files for code examples.
A few new functions and lots of upgrades for the supervised learning functions:
For the CatBoost and XGBoost set of models, the model insights has expanded quite a bit.
One of the greatest uses of the creating this table on training data is the ability to investigate variable importance by different slices of your data, whether using grouping variables or time slices (or both). For classification and multiclass models, you can investigate variable importance by target levels (0 vs 1 / A vs B vs C etc.). You can investigate the AbsShapValues for magnitude importance or ShapValues for directional importance.
For (1) and (2) Note - To utilize the Shap functions, you'll have to have either a CatBoost or XGBoost model. The H2O models return output from the h2o.explain() function.
CatBoost Variable Importance - now a list is returned containing variable importance for the training, validation , and test data sets.
CatBoost Interaction Importance - now a list is returned containing interaction importance for the training, validation , and test data sets.
All Models: Training data can now be returned with predictions (useful for generating insights)
CatBoost + XGBoost Models: Model insights plots have been combined into a single output item called PlotList. PlotList contains evaluation plots, evaluation box plots, partial dependence calibration plots, partial dependence calibration boxplots, ROC plot, gains plot, lift plot, and they are built on both the test data sets (this is what was originally produced) and the combined training + validation data sets
All Models: MultiClass models - for the individual target levels, binary evaluation metrics and all the same evaluation plots returned with the Classification models are generated and returned for both test and the combined train + validation sets
The XGBoost multiclass predictions now match what catboost produces, which are the class predictions along with the class probabilities and all the same expanded model insights that come with that.
All Models: For CatBoost, XGBoost, and H2O models, there is a new argument called OutputSelection and you can supply the values "Importances", "EvalPlots", "EvalMetrics", "PDFs", "Score_TrainData" to tell the function which output you want it to generate. If you aren't interested in the plot output, for example, you can leave out the "EvalPlots" element and they won't be generated.
For the Classification models, Gains and Lift plots have been added to the output and the function that creates them has been exported and it's called, CumGainsChart().
The ParDepCalPlots() have also been refactored to run faster.