Hyperparameter optimization algorithms for use in the MLJ machine learning framework
Hyperparameter optimization for MLJ machine learning models.
See Tuning Models · MLJ for usage examples.
Note: This component of the MLJ stack applies to MLJ versions 0.8.0 and higher. Prior to 0.8.0, tuning algorithms resided in MLJ.
This repository is not intended to be directly imported by the general MLJ user. Rather, MLJTuning is a dependency of the MLJ machine learning platform, which allows MLJ users to perform a variety of hyperparameter optimization tasks from there.
MLJTuning is the place for developers to integrate hyperparameter optimization algorithms (here called tuning strategies) into MLJ, either by adding code to /src/strategies, or by importing MLJTuning into a third-party package and implementing MLJTuning's tuning strategy interface.
MLJTuning is a component of the MLJ stack which does not have MLJModels as a dependency (no ability to search and load registered MLJ models). It does however depend on MLJBase and, in particular, on the resampling functionality currently residing there.
This repository contains:
a tuning wrapper called TunedModel
for transforming arbitrary
MLJ models into "self-tuning" ones - that is, into models which,
when fit, automatically optimize a specified subset of the original
hyperparameters (using cross-validation or other resampling
strategy) before training the optimal model on all supplied data
an abstract tuning strategy interface to allow developers to conveniently implement common hyperparameter optimization strategies, such as:
search models generated by an arbitrary iterator, eg models = [model1, model2, ...]
(built-in Explicit
strategy)
grid search (built-in Grid
strategy)
Latin hypercubes (built-in LatinHypercube
strategy,
interfacing the
LatinHypercubeSampling
package)
random search (built-in RandomSearch
strategy)
particle swarm optimization (MLJParticleOptimization.jl)
bandit
simulated annealing
Bayesian optimization using Gaussian processes
structured tree Parzen estimators (MLJTreeParzenTuning
from
TreeParzen.jl)
multi-objective (Pareto) optimization
genetic algorithms
AD-powered gradient descent methods
a selection of implementations of the tuning strategy interface, currently all those accessible from MLJ itself.
the code defining the MLJ functions learning_curves!
and learning_curve
as
these are essentially one-dimensional grid searches
This document assumes familiarity with the Evaluating Model Performance and Performance Measures sections of the MLJ manual. Tuning itself, from the user's perspective, is described in Tuning Models.
What follows is an overview of tuning in MLJ. After the overview is an elaboration on those terms given in italics.
All tuning in MLJ is conceptualized as an iterative procedure, each iteration corresponding to a performance evaluation of a single model. Each such model is a mutated clone of a fixed prototype. In the general case, this prototype is a composite model, i.e., a model with other models as hyperparameters, and while the type of the prototype mutations is fixed, the types of the sub-models are allowed to vary.
When all iterations of the algorithm are complete, the optimal model is selected by applying a selection heuristic to a history generated according to the specified tuning strategy. Iterations are generally performed in batches, which are evaluated in parallel (sequential tuning strategies degenerating into semi-sequential strategies, unless the batch size is one). At the beginning of each batch, both the history and an internal state object are consulted, and, on the basis of the tuning strategy, a new batch of models to be evaluated is generated. On the basis of these evaluations, and the strategy, the history and internal state are updated.
The tuning algorithm initializes the state object before iterations begin, on the basis of the specific strategy and a user-specified range object.
Recall that in MLJ a model is an object storing the
hyperparameters of some learning algorithm indicated by the name of
the model type (e.g., DecisionTreeRegressor
). Models do not
store learned parameters.
An evaluation is an object E
returned by some call to the
evaluate!
method, when passed the resampling strategy (e.g.,
resampling=CV(nfolds=9)
) and a battery of user-specified
performance measures (e.g., measures=[cross_entropy, accuracy]
). An evaluation object E
contains a list of measures
E.measure
and a list of corresponding measurements
E.measurement
, each of which is the aggregrate of measurements for
each resampling of the data, which are stored in E.per_fold
(a
vector of vectors). In the case of a measure that reports a value
for each individual observation (to obtain the per-fold measurement,
by aggregation) the per-observation values can be retrieved from
E.per_observation
. This last object includes missing
entries for
measures that do not report per-observation values
(reports_per_observation(measure) = false
) such as auc
. See
Evaluating Model
Performance
for details. There is a trait for measures called orientation
which is :loss
for measures you ordinarily want to minimize, and
:score
for those you want to maximize. See Performance
measures
for further details.
A tuning strategy is an instance of some subtype S <: TuningStrategy
, the name S
(e.g., Grid
) indicating the tuning
(optimization) algorithm to be applied. The fields of the tuning
strategy - called tuning hyperparameters - are those tuning
parameters specific to the strategy that do not refer to specific
models or specific model hyperparameters. So, for example, a
default resolution to be used in a grid search is a hyperparameter
of Grid
, but the resolution to be applied to a specific
hyperparameter (such as the maximum depth of a decision tree) is
not. This latter parameter would be part of the user-specified
range object. A multi-objective tuning strategy is one that
consults the measurements of all measures
specified by the user;
otherwise only the first measure is consulted, although
measurements for all measures are nevertheless reported.
A selection heuristic is a rule describing how the outcomes of the
model evaluations will be used to select the best (optimal) model,
after all iterations of the optimizer have concluded. For example,
the default NaiveSelection()
heuristic
simply selects the model whose evaluation E
has the smallest or
largest E.measurement[1]
value, according to whether the metric
E.measure[1]
is a :loss
or :score
. Most heuristics are
generic in the sense they will apply no matter what tuning
strategy is applied. A selection heuristic supported by a
multi-objective tuning strategy must select some "best" model
(e.g., a random Pareto optimal solution).
The history is a vector of identically-keyed named tuples, one tuple per iteration. The tuple keys include:
model
: for the MLJ model instance that has been evaluated
measure
, measurement
, per_fold
: for storing the values of
E.measure
, E.measurement
and E.per_fold
, where E
is the corresponding
evaluation object.
metadata
: for any tuning strategy-specific information required
to be recorded in the history but not intended to be reported to
the user (for example an implementation-specific representation
of model
).
There may be additional keys for tuning-specific information that is to be reported to the user (such as temperature in simulated annhealing).
A range is any object whose specification completes the
specification of the tuning task, after the prototype, tuning
strategy, resampling strategy, performance measure(s), selection
heuristic, and total iteration count are given. This definition is
intentionally broad and the interface places no restriction on the
allowed types of this object, although all strategies should
support the one-dimensional range objects defined in MLJBase
(see below). Generally, a range may be understood as
the "space" of models being searched plus strategy-specific data
explaining how models from that space are actually to be generated
(e.g., grid resolutions or probability distributions specific to
particular hyper-parameters). For more on range types see Range
types below.
Recall, for context, that in MLJ tuning is implemented as a model wrapper. A model is tuned by fitting the wrapped model to data (which also trains the optimal model on all available data). This process determines the optimal model, as defined by the selection heuristic (see above). To use the optimal model one predicts using the wrapped model. For more detail, see the Tuning Models section of the MLJ manual.
In setting up a tuning task, the user constructs an instance of the
TunedModel
wrapper type, which has these principal fields:
model
: the prototype model instance mutated during tuning (the
model being wrapped)
tuning
: the tuning strategy, an instance of a concrete
TuningStrategy
subtype, such as Grid
resampling
: the resampling strategy used for performance
evaluations, which must be an instance of a concrete
ResamplingStrategy
subtype, such as Holdout
or CV
measure
: a measure (loss or score) or vector of measures available
to the tuning algorithm, the first of which is optimized in the
common case of single-objective tuning strategies
selection_heuristic
: some instance of SelectionHeuristic
, such
as NaiveSelection()
(default)
range
: as defined above - roughly, the space of models to be searched
n
: the number of iterations, which is the number of distinct model
evaluations that will be added to the history, unless the tuning
strategy's supply of models is exhausted (e.g., Grid
). This is not
to be confused with an iteration count specific to the tuning strategy
(e.g., Particle Swarm Optimization).
acceleration
: the computational resources to be applied (e.g.,
CPUProcesses()
for distributed computing and CPUThreads()
for
multi-threaded processing)
acceleration_resampling
: the computational resources to be applied
at the level of resampling (e.g., in cross-validation)
As sample implementations, see /src/strategies/
Several functions are part of the tuning strategy API:
clean!
: for validating and resetting invalid fields in tuning strategy (optional)
setup
: for initialization of state (compulsory)
extras
: for declaring and formatting additional user-inspectable information
going into the history
tuning_report
: for declaring any other strategy-specific information
to report to the user (optional)
models
: for generating batches of new models and updating the
state (compulsory)
default_n
: to specify the total number of models to be evaluated when
n
is not specified by the user
supports_heuristic
: a trait used to encode which selection
heuristics are supported by the tuning strategy (only needed if you
define a strategy-specific heuristic)
Important note on the history. The initialization and update of the
history is carried out internally, i.e., is not the responsibility of
the tuning strategy implementation. The history is always initialized to
nothing
, rather than an empty vector.
The above functions are discussed further below, after discussing types.
Each tuning algorithm must define a subtype of TuningStrategy
whose
fields are the hyperparameters controlling the strategy that do not
directly refer to models or model hyperparameters. These would
include, for example, the default resolution of a grid search, or the
initial temperature in simulated annealing.
The algorithm implementation must include a keyword constructor with defaults. Here's an example:
mutable struct Grid <: TuningStrategy
goal::Union{Nothing,Int}
resolution::Int
shuffle::Bool
rng::Random.AbstractRNG
end
# Constructor with keywords
Grid(; goal=nothing, resolution=10, shuffle=true,
rng=Random.GLOBAL_RNG) =
Grid(goal, resolution, MLJBase.shuffle_and_rng(shuffle, rng)...)
Generally new types are defined for each class of range object a tuning strategy should like to handle, and the tuning strategy functions to be implemented are dispatched on these types. It is recommended that every tuning strategy support at least these types:
one-dimensional ranges r
, where r
is a MLJBase.ParamRange
instance
(optional) pairs of the form (r, data)
, where data
is extra
hyper-parameter-specific information, such as a resolution in a grid
search, or a distribution in a random search
abstract vectors whose elements are of the above form
Recall that ParamRange
has two concrete subtypes NumericRange
and
NominalRange
, whose instances are constructed with the MLJBase
extension to the range
function.
Note in particular that a NominalRange
has a values
field, while
NumericRange
has the fields upper
, lower
, scale
, unit
and
origin
. The unit
field specifies a preferred length scale, while
origin
a preferred "central value". These default to (upper - lower)/2
and (upper + lower)/2
, respectively, in the bounded case
(neither upper = Inf
nor lower = -Inf
). The fields origin
and
unit
are used in generating grids or fitting probability
distributions to unbounded ranges.
A ParamRange
object is always associated with the name of a
hyperparameter (a field of the prototype in the context of tuning)
which is recorded in its field
attribute, a Symbol
, but for
composite models this might be a be an Expr
, such as
:(atom.max_depth)
.
Use the iterator
and sampler
methods to convert ranges into
one-dimensional grids or for random sampling, respectively. See the
tuning
section
of the MLJ manual or doc-strings for more on these methods and the
Grid
and RandomSearch
implementations.
clean!
method: For validating tuning strategyMLJTuning.clean!(tuning::MyTuningStrategy)
Some tuning strategies have mutable fields that only support specific set of
values: a particle swarm strategy, for instance, should have at least three
agents for the algorithm to work. As such, it is recommended to implement the
clean!
method to warn the user and correct invalid tuning hyperparameters.
The method should return a string message if some fields have been reset or an
empty string otherwise, and will be called internally whenever a TunedModel
machine is fit!
.
The default fallback for clean!
returns an empty string.
setup
method: To initialize statestate = setup(tuning::MyTuningStrategy, model, range, n, verbosity)
The setup
function is for initializing the state
of the tuning
algorithm (available to the models
method). Be sure to make this
object mutable if it needs to be updated by the models
method. The
arguments model
and n
are what the user has specified in their
TunedModel
instance; recall model
is the prototype to be cloned
and mutated, and n
the total number of mutations to be generated.
The state
is a place to record the outcomes of any necessary
intialization of the tuning algorithm (performed by setup
) and a
place for the models
method to save and read transient information
that does not need to be recorded in the history.
The setup
function is called once only, when a TunedModel
machine
is fit!
the first time, and not on subsequent calls (unless
force=true
). (Specifically, MLJBase.fit(::TunedModel, ...)
calls
setup
but MLJBase.update(::TunedModel, ...)
does not.)
The verbosity
is an integer indicating the level of logging: 0
means logging should be restricted to warnings, -1
, means completely
silent.
The fallback for setup
is:
setup(tuning::TuningStrategy, model, range, n, verbosity) = range
However, a tuning strategy will generally want to implement a setup
method for each range type it is going to support:
MLJTuning.setup(tuning::MyTuningStrategy, model, range::RangeType1, n, verbosity) = ...
MLJTuning.setup(tuning::MyTuningStrategy, model, range::RangeType2, n, verbosity) = ...
etc.
extras
method: For adding user-inspectable data to the historyMLJTuning.extras(tuning::MyTuningStrategy, history, state, E) -> named_tuple
This method should return any user-inspectable information to be
included in a new history entry, that is in addition to the model
,
measures
, measurement
and per_fold
data.
This method must return a named tuple,
human readable if possible. Each key of the
returned named tuple becomes a key of the new history entry.
Here E
is the full evalutation object for model
and history
the
current history (before adding the new entry).
The fallback for extras
returns an empty named tuple.
models
method: For generating model batches to evaluateMLJTuning.models(tuning::MyTuningStrategy, model, history, state, n_remaining, verbosity)
-> vector_of_models, new_state
This is the core method of a new implementation. Given the existing
history
and state
, it must return a vector ("batch") of new
model instances vector_of_models
to be evaluated, and the updated
state
. Any number of models may be returned (and this includes an
empty vector or nothing
) and the evaluations will be performed in
parallel (using the mode of parallelization defined by the
acceleration
field of the TunedModel
instance).
Important note. Parallelization means the order in which the
history
gets extended after models(...)
returns its list of new
candidates is generally not the same order in which the candidates
are returned. Some implementations may therefore need to attach extra
"labeling" metadata to each model, as explained below, so that the
existing history can be suitably interpreted.
If more models are returned than needed (because including them would
create a history whose length exceeds the user-specified number of
iterations tuned_model.n
) then the surplus models are saved, for use
in a "warm
restart"
of tuning, when the user increases tuned_model.n
. The remaining
models are then evaluated and these evaluations are added to the
history. In any warm restart, no new call to models
will be made
until all saved models have been evaluated, and these evaluations
added to the history.
If the tuning algorithm exhausts it's supply of new models (because,
for example, there is only a finite supply, as in a Grid
search)
then vector_of_models
should be an empty vector or nothing
. The
interface has no fixed "batch-size" parameter, and the tuning
algorithm is happy to receive any number of models; a surplus is
handled as explained above, a shortfall will trigger an early stop
(so that the final history has length less than tuned_model.n
).
If needed, extra metadata may be attached to each model returned; see below.
Sequential tuning strategies generating models non-deterministically
(e.g., simulated annealing) might choose to include a batch size
hyperparameter, and arrange that models
returns batches of the
specified size (to be evaluated in parallel when acceleration
is set
appropriately). However, the evaluations and history updates do not
occur until after the models
call, so it may be complicated or
impossible to preserve the original (strictly) sequential algorithm in
that case, which should be clearly documented.
Some simple tuning strategies, such as RandomSearch
, will want to
return as many models as possible in one hit. To this end, the
variable n_remaining
is passed to the models
call; this is the
difference between the current length of the history and
tuned_model.n
.
If a tuning strategy implementation needs to record additional
metadata in the history, for each model generated, then instead of
model instances, vector_of_models
should be vector of tuples of the
form (m, metadata)
, where m
is a model instance, and metadata
the associated data. To access the metadata for the j
th element of
the existing history, use history[j].metadata
.
tuning_report
method: To add to the user-inspectable reportAs with any model, fitting a TunedModel
instance generates a
user-accessible report. Note that the fallback report already includes
additions to the history created by the extras
method mentioned
above. To add more strategy-specific information to the report, one
overloads tuning_report
.
Specically, the report generated by fitting a TunedModel
is
constructed with this code:
report1 = (best_model = best_model,
best_history_entry = best_user_history_entry,
history = user_history,
best_report = best_report)
report = merge(report1, tuning_report(tuning, history, state))
where:
best_model
is the best model instance (as selected according to
the user-specified selection heuristic
).
best_user_history
is the corresponding entry in the history with metadata
removed.
best_report
is the report generated when fitting the best_model
on all available data.
user_history
is the full history with metadata
entries removed.
tuning_report(::MyTuningStrategy, ...)
is a method the implementer
may overload that ***must return a named tuple, preferably human readable
The fallback for tuning_report
returns an empty named-tuple.
default_n
method: For declaring the default number of iterationsMLJTuning.default_n(tuning::MyTuningStrategy, range)
The models
method, which is allowed to return multiple models in
it's first return value vector_of_models
, is called until one of the
following occurs:
The length of the history matches the number of iterations specified
by the user, namely tuned_model.n
where tuned_model
is the user's
TunedModel
instance. If tuned_model.n
is nothing
(because the
user has not specified a value) then default_n(tuning, range)
is
used instead.
vector_of_models
is empty or nothing
.
The fallback is
default_n(tuning::TuningStrategy, range) = DEFAULT_N
where DEFAULT_N
is a global constant. Do using MLJTuning; MLJTuning.DEFAULT_N
to see check the current value.
supports_heuristic
traitIf you define a selection heuristic SpecialHeuristic
(see
below) and that
heuristic is specific to a tuning strategy TuningStrategy
then you
must define
MLJTuning.supports_heuristic(::TuningStrategy, ::SpecialHeuristic) = true
A number of built-in tuning strategy implementations of the MLJTuning API can be found at /src/strategies.
The simplest Explicit
strategy is the simplest but is also an odd
case as the range
is just an iterator of MLJ models. These models
need not share a common type and model
is never cloned.
For slightly less trivial example, see
the Grid
search code
Recall that a selection heuristic is a rule which decides on the
"best model" given the model evaluations in the tuning history. New
heuristics are introduced by defining a new struct SomeHeuristic
subtyping
SelectionHeuristic
and implementing a method
MLJTuning.best(heuristic::SomeHeuristic, history) -> history_entry
where history_entry
is the entry in the history corresponding to the model deemed "best".
Below is a simplified version of code
defining the default heuristic NaiveSelection()
which simply chooses the model with the lowest (or highest) aggregated
performance estimate, based on the first measure specified by the user
in his TunedModel
construction (she may specify more than one).
struct NaiveSelection <: MLJTuning.SelectionHeuristic end
function best(heuristic::NaiveSelection, history)
measurements = [h.measurement[1] for h in history]
measure = first(history).measure[1]
if orientation(measure) == :score
measurements = -measurements
end
best_index = argmin(measurements)
return history[best_index]
end
Because this selection heuristic is generic (applies to all tuning strategies) we additionally define
MLJTuning.supports_heuristic(strategy, heuristic::NaiveSelection) = true
For strategy-specific selection heuristics, see above on how to set this trait.