SigmoReLU: An improvement activation function by combining Sigmoid and ReLU

Two of the most common activation functions (AF) in deep neural networks (DNN) training are Sigmoid and ReLU. Sigmoid was tend to be more popular the previous decades, but it was suffering with the common vanishing gradient problems. ReLU has resolved these problems by using zero gradient and not tiny values for negative weights and the value “1” for all positives. Although it significant resolves the vanishing of the gradients, it poses new issues with dying neurons of the zero values. Recent approaches for improvements are in a similar direction by just proposing variations of the AF, such as Leaky ReLU (LReLU), while maintaining the solution within the same unresolved gradient problems. In this paper, the combining of the Sigmoid and ReLU in one single function is proposed, as a way to take the advantages of the two. The experimental results demonstrate that by using the ReLU’s gradient solution on positive weights, and Sigmoid’s gradient solution on negatives, has a significant improvement on performance of training Neural Networks on image classification of diseases such as COVID-19, text and tabular data classification tasks on five different datasets. MSC Subject Classification: 68T07, 68T45, 68T10, 68T50, 68U35


INTRODUCTION
In previous decades, neural networks have usually employed logistic sigmoid activation functions.
Unfortunately, this type of AF is affected by saturation issues such as vanishing gradient. To overcome such weakness and improve accuracy results, an active area of research is trying design novel activation functions (Franco Manessi et al., 2019), with the ReLU appears be the most wellestablished the last years. However, ReLU also suffers from 'dying gradient' problem and is has slightly impact on training. Many variations of the AF, such as LReLU are proposed, to solve this issue, while maintaining the solution within the same unresolved gradient problems. Although recent developments of AFs for Shallow and Deep Learning Neural Networks (NN), such as the QReLU/m- QReLU (Parisi et al., 2020a), m-arcsinh (Parisi et al., 2020b, and ALReLU (Mastromichalakis, 2020) the repeatable and reproducible functions have remained very limited and confined to three activation functions regarded as 'gold standard ' (Parisi et al., 2020b). The sigmoid and tanh are well-known for their common vanishing gradient issues and only ReLU function seems to be more accurate and scalable for DNNs, despite the 'dying ReLU' problem.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 9 June 2021 doi:10.20944/preprints202106.0252.v1 In this work, a new AF, SigmoReLU, is proposed, by combining the advantages of the two different state-of-the-art AFs, ReLU and Sigmoid. This new AF is using both gradient solutions of ReLU and Sigmoid, depending the negative or positive weights input. This 'solution blending', depending the inputs signs, is trying to solve the common gradient vanishing and 'dying ReLU' problems simultaneously and it shows significance positive impact on training and classification accuracy of DNNs as it is concluded from the results of the numerical evaluation performed. The combination of AFs to increase NN performance appears to have remained unknown in the literature review, with an exception of some the recent works (Renlong Jie et al., 2020), (Franco Manessi et al., 2019). The outline of this paper is as follows: Section 2 contains one of the main contributions of this work, which includes the implementation of SigmoReLU in Keras. Section 3 presents experimental results of the proposed AF, including an evaluation of the accuracy of the training. Also it is compared to other well defined AFs in the field. Finally, discussion and the main conclusions of the work are devoted on Section 4.

Datasets and NN models hyperparameters
The following data sets for image, text and tabular data classification that were also used on ALReLU

The SigmoReLU AF
Rectified Linear Unit, or ReLU, is commonly used on between layers to add nonlinearity in order to handle more complex and nonlinear datasets. Fig. 1, demonstrates the ReLU that can be expressed as follows (Eq. (2) is ReLU derivative): (1) The issues of ReLU come with the fact that it is not differentiable x=0, since ReLU sets all values < 0 to zero. Although it can benefit on sparse data, when gradient is 0 the neurons arriving at large negative values and cannot recover from being stuck at 0. The neuron at this stage effectively dies. This is known as the 'dying ReLU' problem and can leads the network essentially stops learning and underperforms.
Current improvements to the ReLU, such as the LReLU, allow for a more non-linear output to either account for small negative values or facilitate the transition from positive to small negative values, without eliminating the problem though.
Sigmoid on the other hand, suffers of Vanishing Gradient problem. If the Sigmoid weights input (in absolute value) is too large, the gradient of the sigmoid function becomes too small. The disadvantage of this is that if you have many layers (i.e. DNN), you will multiply these gradients, and the product of many smaller than 1 values goes to zero very quickly. For this reason, Sigmoid and its relatives such as tanh, are not suitable for training DNN. This problem has been solved by ReLU, which the gradient is either 0 for x<0 or 1 for x>0 meaning that you can put as many layers as you like, because multiplying the gradients will neither vanish nor explode. This is the reason that ReLU is more commonly used in DNNs the last years. In Eq. (3), Eq. (4) and Fig.2 the Sigmoid and its derivative are demonstrated. (3) x x dy e f x dx e        Although ReLU is commonly more robust and useful than Sigmoid, the later has also some advantages in different situations. Actually, there are times that Sigmoid can perform better that ReLU.
Consequently, this study investigates the development of a new AF that take both the best advantages of these two functions, to obviate the problem of the 'dying ReLU' and the vanishing gradient of Sigmoid. This is achieved by using the ReLU gradient solution if of x>0 or Sigmoid(x) < x and Sigmoid's gradient solution on exp(-x) > -1 or Sigmoid(x) ≥ x (Eq. 6). Although it is obvious that derivative of this new function is not differentiable everywhere as it demonstrated in Fig. 3 and Eq. (6), it seems that is not cause serious problems, and not impact training performance. Instead, the experiments and results on Section 3, indicate a significance positive impact on performance of this functions combination. (5)

EXPERIMENTAL RESULTS
In order to estimate the performance of training and classification accuracy a 5-Fold validation used in every dataset. The 5-Fold validation procedure has been executed 10 times for every model and dataset to handle the uncertainty caused by GPU and TF. The average results are demonstrated in this section and it's obvious that support the theoretical superiority of the proposed SigmoidReLU AF when compared to well-established ReLU and LReLU AFs. The classification performance results are demonstrated in Table 1 and are described above: Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 9 June 2021 doi:10.20944/preprints202106.0252.v1 and contribution of this work is that the combination of different AFs can also combine the single advantages of them and achieve more accurate and robust results by using both two different AFs solutions, depending the input sign. By this way, they are solved both vanishing gradient and 'dying ReLU' problems at the same time. It is also important that the proposed combination has very high performance in terms of accuracy in COVID-19 image classification. In the future work, different combinations of AFs may proposed and tested, such as ReLU and tanh.