LAMB Optimizer for Large Batch Training (TensorFlow version)
This is a simple implementation of LAMB Optimizer, which appeared in the paper "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes".
The older name of the paper was "Reducing BERT Pre-Training Time from 3 Days to 76 Minutes"
Update: official implementation of LAMB optimizer is now available: https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.py
scaling function
).LAMB optimizer is originally designed for large batch learning in neural networks, but could also used in small batch size as indicated by authors.
The implementation is based on BERT repository, which uses AdamWeightDecayOptimizer
(appears in optimization.py
) for pre-training and fine-tuning.
LAMBOptimizer
as a regular optimizer in TensorFlow, similar to Adam
or AdamWeightDecayOptimizer
.optimization.py
.learning_rate
.Here are the numbers on several three classical neural networks (MLP, CNN, Bi-RNN, Bi-GRU, Bi-LSTM) with different optimizers (Adam, AdamW, LAMB).
I only list results of batch={64, 128, 1024, 16384}. For full results, please see FULL_RESULTS.md
.
Optimizer | MLP | CNN | Bi-RNN | Bi-GRU | Bi-LSTM | Note |
---|---|---|---|---|---|---|
Adam | 97.03 | 98.93 | 96.24 | 98.92 | 99.04 | Just ordinary Adam |
AdamW | 97.11 | 99.01 | 96.50 | 99.11 | 99.04 | Used in BERT |
LAMB | 98.27 | 99.33 | 97.73 | 98.83 | 98.94 | New optimizer for large batch |
Optimizer | MLP | CNN | Bi-RNN | Bi-GRU | Bi-LSTM | Note |
---|---|---|---|---|---|---|
Adam | 96.38 | 98.76 | 97.73 | 99.08 | 99.09 | Just ordinary Adam |
AdamW | 96.57 | 98.72 | 98.05 | 98.96 | 99.00 | Used in BERT |
LAMB | 97.90 | 99.20 | 98.04 | 98.87 | 98.76 | New optimizer for large batch |
Optimizer | MLP | CNN | Bi-RNN | Bi-GRU | Bi-LSTM | Note |
---|---|---|---|---|---|---|
Adam | 93.05 | 97.92 | 98.10 | 98.94 | 98.67 | Just ordinary Adam |
AdamW | 93.67 | 98.00 | 98.19 | 98.86 | 98.82 | Used in BERT |
LAMB | 97.68 | 98.82 | 98.27 | 98.61 | 98.47 | New optimizer for large batch |
Optimizer | MLP | CNN | Bi-RNN | Bi-GRU | Bi-LSTM | Note |
---|---|---|---|---|---|---|
Adam | 88.46 | 95.06 | 95.98 | 97.81 | 97.74 | Just ordinary Adam |
AdamW | 91.46 | 96.57 | 96.34 | 98.45 | 98.39 | Used in BERT |
LAMB | 93.23 | 97.89 | 93.76 | 87.60 | 80.36 | New optimizer for large batch |
Note: The conclusions are only made by the results above.
Adam
and AdamW
in most of the times, and shows consistent results among different batch sizes.Adam
and AdamW
on large batch, showing its excellent scalability.Adam
and AdamW
on complex RNN-based models, despite batch size.Check mnist_tensorflow.ipynb
for details.
Note: You know the GPU/TPU won't get exactly the same results even we use fixed random seed.
For help or issues, please submit a GitHub issue.