Even if its true that Adam and AdamW behave the same way when the weight decay is set to 0, I dont think its enough to change that default behavior (0.01 is a great default otherwise, that is the one we set in fastai for the Learner after countless experiments, but I think it should be set in a higher-level API, not the optimizer itself). Regularization techniques like weight decay, dropout, and early stopping can be used to address overfitting in transformers. Weight decay decoupling effect. Here, we fit a Gaussian Process model that tries to predict the performance of the parameters (i.e. The training setting of these models was carried out under the same conditions of the C3D (batch size: 2, Adam optimizer and cosine annealing scheduler, learning rate: 3 10 4 $3\times 10^{-4}$, weight decay: 3 10 5 $3\times 10^{-5}$). We also combine this with an early stopping algorithm, Asynchronous Hyperband, where we stop bad performing trials early to avoid wasting resources on them. Weight decay is a form of regularization-after calculating the gradients, we multiply them by, e.g., 0.99. Why AdamW matters. Adaptive optimizers like Adam have where $\lambda$ is a value determining the strength of the penalty (encouraging smaller weights). L regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph {not} the case for adaptive gradient algorithms, such as Adam. 