ADAS is short for Adaptive Step Size, it's an optimizer that unlike other optimizers that just normalize the derivative, it fine-tunes the step size, truly making step size scheduling obsolete, achieving state-of-the-art training performance
ADAS is short for Adaptive Step Size, it's an optimizer that unlike other optimizers that just normalize the derivatives, it fine-tunes the step size, truly making step size scheduling obsolete.
Not to be confused with https://github.com/mahdihosseini/AdaS https://openreview.net/forum?id=qUzxZj13RWY etc.
git clone https://github.com/YanaiEliyahu/AdasOptimizer.git && cd AdasOptimizer
Optimizer files:
struct layer
in adasopt-cpp/main.cpp.Tips for getting the best results, or for when something isn't right:
1 / (1 - beta_3)
so it has some proportion to the number of optimization steps.This is a graph of ADAS (blue) and ADAM (orange)'s inaccuracy percentages in log scale (y-axis) over epochs (x-axis) on MNIST's training dataset using shallow network of 256 hidden nodes. While ADAM slows down significantly overtime, ADAS converages to 0% inaccuracy (AKA 100% accuracy) in 11 iterations.
Same as training performance, but the graph means performance on the test set and cifar-100 with vanilla MobileNetV2 with dropout 0.15 on the top layer. The average accuracy of the last 50 epochs here is 26.4% and 37.4% for Adam and Adas respectively, and variances are 0.00082 and 8.88E-6. Conclusions:
(37.4%-26.4%)/(100%-26.4%) = 14.9%
This section explains how ADAS optimizes step size.
The problem of finding the optimal step size formulates itself into optimizing f(x + f'(x) * step-size)
by step-size
.
Which is translated into this formula: step-size(n+1) = step-size(n) + f'(x) * f'(x + f'(x) * step_size(n))
.
Computing the above formula requires evaluation of the gradient on the entire dataset twice for each update of step-size
, which is computationally expensive;
to overcome this problem, replace f'(x)
with a exponential moving average of x
's derivative, and f'(x + f'(x) * step_size(n))
with x
's derivative, and then update the step-size
according to the formula in SGD-context.
For each layer in the network:
0.1 / input-nodes-count
..99999
) of weights updates (of the layer)..99999
), is a running averages of the input..9999
) of input step sizes updates.Layer's weights, input step sizes and SSSS are all updated, each taking the next item in the list as their step size, where SSSS's step size is a constant, default is 0.0005
.
For each update done, the derivatives are normalized like in ADAM's algorithm, just without the momentum part.
For each update to a step size weight is done by x(n+1) = x(n) + x(n) * u * step-size
, where x
is the step size itself, u
is the update done to the step size, step-size
is the step size's step size.
Backpropagation:
3.
)input - input_offsets
). This reminds a lot of batch normalization, it improves the training performance because it mimics second-order optimizations.5.
).If you are having hard time understanding with the above words, then try reading adasopt.py.