2. Literature Review
In this section, we review some related papers on adaptive learning rate optimizers. We proceed by presenting some basic concepts of those adaptive learning rate methods, where their update rules are provided.
Ruo-Yu Sun discusses the generic optimization methods used in training neural networks, such as stochastic gradient descent and adaptive gradient methods in [
1] in recognition of their capability to improve the training of NN. Adaptive learning rate methods such as Adam (Adaptive Moment Estimation), RMSProp (Root Mean Square Propagation), and Adagrad (Adaptive Gradient Methods) modify the learning rate based on the historical gradients, bringing more effective navigation of the optimization settings. He further notes that there is still a gap in practical performance and theoretical understanding of these methods emphasizing that “bringing theory closer to practice is still a huge challenge for both theoretical and empirical researchers”. In a nutshell, adaptive learning rate methods are essential methods to mitigate issues related to gradient explosion/vanishing and improve convergence rate.
Moreover, Diederik P. Kingma and Lei Ba Jimmy [
2] introduce the Adam (Adaptive Moment Estimation) method as an adaptive learning rate algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments that effectively combine the strengths of AdaGrad and RMSProp, which make it well-suited for sparse gradients and non-stationary objectives. Their empirical results demonstrate that Adam outperforms traditional optimization methods like SGD and other adaptive methods in consideration of convergence rate and robustness in different machine learning tasks (see
Figure 2). Its advantages include low memory requirements and intuitive interpretation of hyperparameters that require little tuning. However, it is noted that it can exhibit instability if the learning rate is not properly adjusted.
As noted earlier, optimization is crucial for deep learning training. The traditional optimizer has been SGD coupled momentum because of its simplicity, low computational cost, and good generalizability. But it needs fine-tuning of a constant learning rate, which can cause slow convergence or suboptimal solutions in some cases. Adaptive learning rate optimizers like Adam, RMSProp, and AdaGrad modify the learning rate accordingly based on the gradient history, which often leads to faster convergence, especially when you have a large dataset or in tasks with sparse gradients. We first discuss the gradient descent method (GD) briefly for smooth transitioning of understanding since some of these adaptive learning rate optimizers are based on it
Gradient Descent (GD): A GD method is an iterative process, that starts from an arbitrary point on the loss function and moves down its slope in steps until it reaches the minimum point of the loss function. Its update rule is given as
where
is the learning rate and
is the corresponding gradient of
. This method converges at a linear rate; its solution is globally optimal for the convex objective function, while its downside is high computational cost as it uses the whole data for gradient computation.
Stochastic Gradient Descent (SGD): SGD was introduced to address the issue of the high computational cost of GD for large-scale data, as it (GD) passes the whole data at once in each iteration, but instead, SGD randomly samples one data point
or a mini-batch from the training dataset for each update. Let us rewrite equ(
3) as
where
represents the sum of training loss for a mini-batch of training samples and
B is the total number of mini-batches. Its update’s rule is given as
This method converges at a sublinear rate, which saves computation costs, while its downside is that the solution may be stuck at the saddle point in some cases.
SGD with Momentum: Shiliang Sun et al [
3] note that the concept of momentum is derived from the mechanics of physics, which simulates the inertia of objects, and the idea of applying momentum in SGD is to preserve the influence of the previous update direction on the next iteration to a certain degree. When SGD randomly picks
i at
t-th iteration, its update rule is given as
Diederik P. Kingma and Lei Ba Jimmy [
2] call this method the stochastic versions of the heavy-ball method and accelerated gradient method, which are also known as “momentum methods" in deep learning.
Adaptive Gradient Methods (AdaGrad): AdaGrad is an improvement of SGD [
4], it modifies the learning rate dynamically using historical gradients from previous steps. Its update rule is given as
where ∘ represents the element-wise product. The method is suitable for dealing with sparse gradient problems but is not suitable for dealing with non-convex problems [
3].
Root Mean Square Propagation (RMSProp): RMSProp was introduced in order to correct the drawback in AdaGrad and Rprop (Resilient Propagation) []. This leads to a new definition of
. So, at the
t-th iteration of RMSProp
i is randomly selected and the update rule is given as
This method improves the drawback in the late stage of AdaGarad and is suitable for non-stationary and non-convex optimization problems. But the downside of it is that the update process may be repeated around the local minimum in the late stage.
Adaptive Moment Estimation (Adam): Adam [
2] combines RMSProp and momentum methods, It makes use of both first- and second-order momentum estimation of the gradient to modify the step size dynamically of each parameter. Its update rule is given as
where
and
are exponential decay rates. The Adam method is relatively stable in gradient processes and suitable for most non-convex problems with high-dimensional datasets.