1. Introduction
During the training of a neural network, the primary process occurring in each neuron is a linear transformation followed by an activation function. The importance of the activation function in deep network training is inevitable [
1,
2].
Listed below are some of the most prevalent types of activation functions for neural networks:
1.1. Binary Step
The presence of a threshold value is used to determine whether a neuron should become active. A threshold is evaluated concerning the activation function input to determine whether a neuron’s input should trigger the function. The input must be greater than the threshold for the neuron to become active. If the input is lower than the threshold, the neuron’s output will not pass on to the next hidden layer. Equation
1 expresses this function:
However, the binary step function is not without its limitations. First, it cannot generate outputs with multiple values; for instance, we cannot use this function to solve classification problems involving multiple classes since zero gradients of the step function impede the backpropagation procedure.
1.2. Linear
The output of the linear activation function is proportional to the input. This activation function is also known as the "no activation function" or the "identity function." This function returns its input value and does not modify its arguments’ weighted sum. This function is expressed mathematically as:
Nevertheless, a linear activation function has two significant drawbacks. Because the function’s derivative is constant and has no relationship with the input x, backpropagation would have no effect. Additionally, if each neuron uses a linear activation function, all neural network layers will collapse into one. The output of a neural network’s final layer will always be a linear function of the input of the first layer, regardless of how many layers the network has. Because of this, a neural network can be simplified down to a single layer when each neuron uses a linear activation function.
The linear activation function is just a regression model. This linear regression model cannot be used to generate intricate maps between the network’s inputs and outputs due to its limited capabilities. We can circumvent the limitations inherent to linear activation functions by employing their non-linear analogs, known as non-linear activation functions. Since the derivative function depends on the input, they make it possible to perform backpropagation. To be more exact, they make it feasible to figure out which weights in the input neurons will predict more precisely. Moreover, the utilization of non-linear activation functions makes it possible to stack multiple layers of neurons, which in turn causes the output to be a non-linear combination of the inputs from each stacked layer. This results in the output being a non-linear combination of the inputs.
1.3. Sigmoid
This function accepts any real value as its input and will always return a value between 0 and 1. As demonstrated below, the output value will be closer to 1 if the input value is greater and closer to 0 if the input value is smaller. We can express this function mathematically as:
Users frequently employ this function in models that require a probability prediction as an output. Given that the probability of anything exists only between 0 and 1, the sigmoid distribution is the best option. The function can differentiate and has a continuous gradient, preventing output value jumps. The sigmoid activation function is an S-shaped curve. The sigmoid function has certain constraints. Sigmoid function derivative is expressed mathematically as:
The only region of the sigmoid graph where the gradient values are meaningful is from -3 to 3, and the rest of the graph becomes much flatter. Hence the gradients of the function will be extremely minimal for values greater than 3 or less than -3. Vanishing gradients will occur when the value of the gradient gets near zero because the network will stop learning at that point. As the value gets closer to zero, the output of the logistic function starts to behave asymmetrically. As a consequence of this, the output of every neuron will always have the same sign. Consequently, training the neural network will become difficult and more likely to become unstable.
1.4. Hyperbolic Tangent
The TANH function is extremely comparable to the sigmoid and logistic activation functions. It even has the same S-shaped output range, which varies from -1 to 1. When using Tanh, the output value will be closer to 1.0 when the input value is larger and closer to -1.0 when the input value is smaller. Equation
5 represents the TANH function:
The output of the TANH activation function is zero-centered. Consequently, it is simple to map output values as either strongly negative, neutral, or strongly positive. Second, because its values can range from -1 to 1, it is frequently utilized in the hidden layers of a neural network. Consequently, the hidden layer’s mean value is either 0 or an extremely close approximation of that value. This attribute helps center the data and simplifies learning the next layer parameters.
TANH gradients tend to converge to zero. Additionally, the TANH function’s gradient is noticeably more critical than the gradient of the sigmoid function. TANH is zero-centered, so vanishing gradients can go in any direction they like, unlike sigmoid, which has a fixed gradient direction. Therefore, when it comes to practical applications, TANH non-linearity is always preferred over sigmoid non-linearity.
1.5. RELU
RELU is a computationally efficient function with a suitable derivative for backpropagation. Not all neurons are activated concurrently by the RELU function. Deactivation of the neurons will occur if the output of the linear transformation is negative and greater than 0. Equation
6 represents the RELU function:
Since the sigmoid and TANH functions activate a much larger number of neurons than the RELU function, which only activates a small number of neurons, the RELU function is significantly more computationally efficient. In addition, the fact that RELU is linear and does not saturate speeds up the convergence of gradient descent to the global minimum of the loss function because the loss function does not reach its maximum value during the process. However, there are some restrictions associated with using this function. The gradient value is zero on the negative side of the graph.
For this reason, the backpropagation process would not update some neuron weights, which can cause neurons to become permanently inactive, known as dying neurons. As a direct result, the model’s capacity to correctly fit the data or learn from it would diminish due to the immediate conversion of all negative input values to zero.
1.6. Leaky RELU
Leaky RELU is a more advanced variant of the RELU function developed to solve the problems caused by Dying neurons which is caused by having a modestly positive slope in the negative region of the function’s domain. Equation
7 expresses Leaky RELU:
The benefits of using Leaky RELU are identical to those of RELU; the only difference is that Leaky RELU enables backpropagation even for values with negative inputs. The gradient on the left side of the graph will move away from zero and into the positive territory when this straightforward adjustment for negative input values happens. As a direct result, we would no longer find dead neurons in that particular region. The following are some limitations placed on this function. Negative Input values could lead to inaccurate predictions. Because negative values have such a low gradient, discovering new model parameters takes much more time.
1.7. Parametric RELU
Parametric RELU is a form of RELU produced to address the zero gradients along the axis’s left side. Equation
8 represents RELU:
Where is the parameter that determines the slope for negative values, the effectiveness of this function might change for various problems according to the value entered into the slope parameter.
1.8. Exponential Linear Units
The Exponential Linear Unit is an alternative to the RELU that changes the slope of the function’s negative portion. In contrast to leaky RELU and parametric RELU, which both use a straight line to define negative values, ELU uses a log curve instead. Equation
9 represents the ELU function :
ELU becomes smooth gradually, while RELU becomes smooth abruptly. The log curve for negative input values eliminates the dying RELU problem. It assists the network in adjusting the weights and biases in the appropriate direction. The time required to complete the computation increased due to the exponential operation. Moreover, no learning of the function hyperparameter - - takes place. Exploding gradient problem is another main problem for ELU.
1.9. SELU
In self-normalizing networks, SELU is the function that would take care of internal normalization. Using this function, we could guarantee that each layer would maintain the same mean and variance as the layers that preceded it. SELU makes this normalization process easier by adjusting the mean and the variance. On the other hand, the SELU activation function can shift the mean using both positive and negative values. In contrast, the RELU activation function cannot output negative values and, therefore, cannot shift the mean. Equation
10 represents the SELU function:
SELU has predefined alpha and lambda values. The primary advantage that SELU possesses in comparison to RELU is since internal normalization occurs quicker than external normalization, the network can converge faster. SELU is a relatively new activation function that needs additional research on CNNs and RNNs, among other architectures.
1.10. SOFTMAX
This activation function is built based on a sigmoid(logistic) activation function. Equation
11 represents the SOFTMAX function:
The SOFTMAX function is a combination of several different sigmoid functions. It computes the proportional probabilities. In a manner analogous to the sigmoid (logistic) activation function, the SOFTMAX activation function also returns the probability associated with each class. In the case of multiclass classification, practitioners widely employ it as an activation function for the final layer of a neural network.
For instance, if we are to assume that there are three different classes, then the output layer would have three neurons. Consider for a moment that the output of the neurons is the following: [1.8, 0.9, 0.68]. If we want a probabilistic perspective, we can apply the SOFTMAX function to these values, producing the following result: [0.58, 0.23, 0.19]. The function will return 1 for the array index corresponding to the probability with the highest value and 0 for the other two array indexes since the probability with the highest value corresponds to the array index receiving the highest value. In this case, index 0 is given the same weight as indexes 1 and 2, which receive no weight. The class that would be output as a result would be the one that corresponds to the first neuron (index 0) out of the three.
1.11. SWISH
Google researchers developed this self-gated activation function. The SWISH activation function consistently matches or outperforms the RELU activation function on deep neural networks applied to challenging domains. This function is constrained below, but unconstrained above, which means that Y will approach a constant value as X approaches negative infinity. Still, as X approaches infinity, Y will approach infinity. Moreover, the SWISH function is constrained below but unconstrained above. Equation
12 represents this function:
SWISH is significantly more advantageous than RELU in several respects. SWISH is a smooth function, indicating that it does not suddenly shift direction at x = 0 as RELU does. Instead, SWISH follows a smooth curve that descends from 0 to values that are less than 0 before rising again. Secondly, within the RELU activation function, any negative values that were lower than a predetermined threshold will be nullified. Despite this, these negative values could help determine the data’s underlying patterns.
The non-monotonic nature of the SWISH activation function makes it possible for a more accurate expression of the input data and the weight that needs to be learned by training.
At this point, the most widely used activation function is RELU [
3,
4,
5,
6], defined as f(x) = max(x,0). RLEU outperforms many other ones, such as sigmoid and TANH, because it can overcome some previous problems, such as vanishing gradients. The use of RELU was a breakthrough that enabled the fully supervised training of state-of-the-art deep networks [
7]. Deep neural networks that use RELU in their heart are more optimized than networks with sigmoid or TANH units [
8]. Researchers have proposed numerous activation functions to replace RELU [
9,
10,
11,
12] but none of them were as successful as SWISH [
8].With the introduction of SWISH, many practitioners started to favor SWISH over RELU because of its simplicity, reliability, and consistent performance improvement across different models and datasets.
Figure 1 shows the abovementioned activation functions.
In This paper, we proposed a new activation function surpassing Google’s brain’s SWISH function, which we named AIF. Our extensive experiments on various datasets and architectures indicate that replacing SWISH units with AIF units accelerates the training speed and improves classification accuracy.