Preprint
Article

This version is not peer-reviewed.

Explainability of Node Modulation in Classical and Physics-Informed Neural Networks via a Generalized Activation Function

Submitted:

28 April 2026

Posted:

29 April 2026

You are already at the latest version

Abstract
The recently proposed Cannistraci–Muscoloni–Gu Generalized Logistic–Logit Function (CMG-GLLF) introduced a flexible, trainable activation function capable of modulating input features in multilayer perceptrons (MLPs). However, its initial implicit approximation in the logit-phase caused computational overhead and numerical instability during training, limiting its application to more deep learning tasks and more complex network architectures. In this study, we derive a fully explicit and differentiable expression for the CMG-GLLF using a one-step Newton’s method approximation. We demonstrate that this new formulation resolves prior numerical instabilities and reduces computational overhead to match vanilla networks. When applied as activation function to input layers, it outperforms vanilla networks and linear modulators across MLPs, a simple CNN, and VGG-16 architectures on CIFAR-10/100, notably achieving superior performance on VGG-16 using a highly efficient "channel-wise" strategy that adds only 6 learnable parameters. Furthermore, we explore CMG-GLLF as a hidden-layer activation function. On image classification tasks, CMG-GLLF parameters naturally converge to approximate a ReLU-like shape, explaining its comparable performance. In contrast, on physics-informed neural networks (PINNs) solving 3 different physics process prediction tasks, CMG-GLLF learns activation functions highly different from standard Tanh activation function, achieving significantly better performance. These results demonstrate CMG-GLLF as a trainable activation function capable of explaining performance gain associated to learned node’s nonlinear functions in neural networks. Overall, this new formulation of CMG-GLLF provides powerful, scalable, and highly explainable node modulation for both classical and physics-informed neural networks.
Keywords: 
;  ;  ;  

2. Introduction

Artificial neural networks rely fundamentally on nonlinear activation functions to learn complex representations. Traditionally, standard activations, such as the Rectified Linear Unit (ReLU) or the hyperbolic tangent (Tanh), apply a fixed mathematical transformation identically across nodes in a network layer.(Nair and Hinton, 2010) However, in complex biological systems, information processing is rarely rigid; neurons can dynamically modulate their responses to regulate information flow, as observed in sensory systems such as the retina.(Perlman and Normann, 1998; Sato and Kefalov, 2025) Inspired by this principle, an ideal trainable activation function should be able to modulate the output signal of each node by learning whether the node should become more sensitive or less sensitive to specific input ranges according to the task.
To address this need for flexible signal modulation, the Cannistraci–Muscoloni–Gu Generalized Logistic–Logit Function (CMG-GLLF) was recently introduced.(Gu et al., 2025) CMG-GLLF is a trainable activation function that allows explicit and independent control over steepness, asymmetry, and exact reachable boundaries on both the x- and y-axes. By learning these parameters during backpropagation, CMG-GLLF modulates the signal sensitivity of each node. The learned logistic curves can sharpen the input distribution by increasing local sensitivity in informative regions. Conversely, the learned logit curves can compress differences across selected input ranges, reducing the node’s sensitivity to variation.
Because this generalized function can be deployed at various stages of a neural network, it is important to establish clear terminology regarding its application. In this study, we use the term “Input Feature Modulator” (IFM) specifically to indicate the application of the CMG-GLLF to the input layer of a network. Conversely, when applied to the hidden layers of a network, it is referred to as a hidden-layer activation function. However, regardless of whether it is deployed as an IFM at the input layer or as activation function in the hidden layers, the underlying mechanism remains identical: it is a trainable activation function designed to dynamically modulate the node input representation for better model performance.
Despite these significant theoretical advantages, the original implementation of the CMG-GLLF (Gu et al., 2025) has certain practical limitations. Because the generalized logit-phase of the curve lacked an explicit analytical expression, previous implementation relied on an implicit approximation algorithm to invert the logistic curve. This implicit approximation not only increase computational overhead, but also requires us to use implicit function theorem to calculate approximated gradients, which might lead to numerical instability during backpropagation. Consequently, its application is limited to simple structures like multi-layer perceptrons (MLPs), preventing it from being tested on more complex architectures.
In this study, we overcome these bottlenecks by deriving a fully explicit and differentiable formulation of the CMG-GLLF using a one-step Newton’s method approximation, which we refer to as CMG-Explicit.(Kelley, 2003) This formulation eliminates the need for implicit inversion and implicit-gradient approximation, thereby resolving the numerical-instability issues of the previous CMG-Implicit implementation while reducing GPU memory and training-time overhead to near-vanilla-network levels. Enabled by this scalable formulation, we systematically evaluate CMG-Explicit as both an input feature modulator and a hidden-layer activation function. Across MLPs, simple CNNs, and VGG-16 on CIFAR-10/100, CMG-Explicit, as IFM, improves performance over vanilla networks and linear modulators, with the channel-wise VGG-16 strategy achieving strong gains using only six additional learnable parameters.(Simonyan and Zisserman, 2015)
When using CMG-Explicit as hidden layer activation functions, except data-driven image classification task, physics-informed neural networks (PINN) provide a complementary benchmark because multilayer perceptrons are the most common neural architecture used in PINN frameworks, while Tanh is widely used and often reported to provide strong and stable performance in PINN applications.(Raissi et al., 2019; Zhao et al., 2024; Fan and Chen, 2026) This makes them a natural setting to evaluate whether CMG-GLLF, a generalized and trainable logistic-logit function, can learn task-adaptive transformations beyond standard choices.
We demonstrate that, when used as a hidden-layer activation function, CMG-Explicit provides explainable functional adaptation: in classical image-classification tasks, CMG-Explicit shape converges toward ReLU-like behavior, and thus achieving similar classification accuracy compared to ReLU; whereas in Physics-Informed Neural Networks it learns more diverse activation shapes that outperform the standard Tanh baseline across multiple PINN benchmarks. These results demonstrate the profound potential for explainability of CMG-Explicit. By analyzing the parameter distributions learned by the network, researchers can achieve a high degree of explainability—empirically showing what functional behavior the nodes are trying to adopt to solve a given task. Therefore, the value of the CMG-GLLF lies not only in improving predictive performance, but in its ability to explicitly explain the performance gains of activation function node modulation in neural networks.

3. Results

3.1. Derivation of the Explicit CMG-Newton Formulation

The previously proposed Cannistraci–Muscoloni–Gu Generalized Logistic–Logit Function (CMG-GLLF) provides a unified mathematical framework to generate both generalized logistic and logit curves, allowing independent control over a curve’s steepness, asymmetry, and precise input/output boundaries.(Gu et al., 2025) While the logistic-phase of the original formulation possessed an explicit analytical expression, the logit-phase (the inverse of the generalized logistic function) lacked a closed-form solution. Consequently, early implementation relied on an implicit approximation algorithm, where the values for the logit-phase curve are derived by dividing the output range into discrete steps and finding the closest matching input point. Furthermore, because this implicit approach lacked an analytical derivative, gradient calculations during backpropagation relied on the implicit function theorem. This limitation introduced large computational overhead (increased GPU memory and training time) and risks of numerical instability (non-finite gradients and parameters) during backpropagation, strictly limiting its scalability to modern deep learning architectures.
To overcome this fundamental bottleneck, we need to derive a fully explicit, differentiable, and computationally efficient mathematical expression for the full CMG-GLLF framework. Newton’s method is a classical root-finding method for solving nonlinear equations. It starts from an initial estimate and improves it by locally linearizing a differentiable function and updating the estimate using its first derivative. This makes it suitable for approximating inverse functions, because the inversion of a nonlinear function can be reformulated as finding the root of an equation. In our case, the logit-phase CMG is obtained by approximating the inverse of the logistic-phase CMG. We selected Newton’s method because, unlike bisection, it does not rely on interval-based conditional updates, and unlike the secant method, it does not require two arbitrary starting points to approximate the derivative.(Joanna M. Papakonstantinou and Richard A. Tapia, 2013; Burden et al., 2016) Since the derivative of the inverse problem is analytically available, one step approximation using Newton’s method can be simplified into a compact explicit and differentiable approximation, which is suitable for neural-network training. By utilizing a one-step Newton’s method approximation to invert the logistic phase, we successfully construct the new formulation, which we will denote as CMG-Explicit, while the old implementation of CMG will be denoted as CMG-Implicit.
Let x ∈ [xmin, xmax] be the input and y ∈ [yL, yR] be the mapped output limits. To simplify the formulation, we define the normalized input t ∈ [0, 1] as:
t = x x m i n x m a x x m i n
The unified, explicit expression for the CMG-Explicit function with inflection rate µ ∈ [0, 1] and the deviate inflection point I ∈ [0, 1] is:
C M G x = y L + y R y L 1 + 1 t t exp 1 μ 2 t I ,   0 μ 0.5 y L + y R y L t + 1 1 μ 2 I t 1 t 1 + 1 1 μ 2 t 1 t ,   0.5 < μ 1
As visualized in Figure 1, the CMG-Explicit derivation perfectly preserves the flexible steepness and asymmetry control of the original implicit framework. The fully explicit nature of this expression guarantees that gradients can be computed instantly via standard automatic differentiation, paving the way for stable and efficient scaling to deep convolutional and physics-informed architectures. The derivation can be found in Supplementary Note 1.

3.2. CMG as Input Feature Modulator on MLP and CNN

To verify that CMG-Explicit could reduce computational overhead, we first applied it as an input feature-modulator (IFM) on Multi-Layer Perceptron (MLP), where CMG acts as a trainable activation function to each input feature. In Table 1, we compare 4 different settings, Vanilla MLP, Linear IFM, CMG-Implicit and CMG-Explicit. On both CIFAR10 and CIFAR100, CMG-Explicit significantly reduce GPU memory and training time compared to the old CMG-Implicit, matching those of the Vanilla MLP. Also, the numerical stability results in Table 2 show that similar to CMG-Implicit, there is no instable training events happen in CMG-Explicit. Figure 2 compares the test accuracy, demonstrating CMG-Explicit can also outperforms all other methods, proving the feasibility and advantage of using CMG-Explicit as input feature modulator. Taken together, these results show not only that CMG-Explicit can reduce computational overhead to match Vanilla MLP, but also that achieves better accuracy than Vanilla MLP and Linear IFM counterparts.
After proving that, used as IFM, CMG-Explicit can resolve the computational overhead and even improves the performance in MLP, we move on to test it as IFM on more complex architectures where the CMG-implicit failed to improve performance with respect to vanilla MLP and linear IFM. Results in Table 3 on a CNN structure tested on CIFAR10 and CIFAR100 tasks show that CMG-Explicit as IFM can resolve the numerical instability during training of CMG-Implicit on this more complex structure.
In Figure 3, we see that after solving this numerical instability issue, CMG-Explicit not only significantly outperforms CMG-Implicit, but also achieves better accuracy than Vanilla CNN and Linear. The computational overhead is also verified in Table 4. Similar to the results on MLP, also on CNN, the CMG-Explicit significantly reduce GPU memory and training time compared to the old CMG-Implicit, matching those of the Vanilla CNN. Taken together, these results show that in general on neural networks CMG-Explicit not only resolves the numerical instability issues in the old CMG implementation, but also achieves better performance than Vanilla CNN and Linear IFM counterparts. So, in the following experiments we use CMG-Explicit as default, and unless explicitly mentioned, CMG refers to CMG-Explicit.
Furthermore, not satisfied by this proof-of-concept on simple CNN, we tested on a more complex and popular structure VGG-16. (Simonyan and Zisserman, 2015) Here we also propose to test assigning each RGB channel a unique CMG modulator in addition to assigning each feature in the input image, we call this strategy “channel-wise” and the previous IFM strategy “element-wise”. The channel-wise strategy can significantly reduce the number of learned CMG-parameters in the IFM model, indeed the channel-wise only adds 6 parameters to the model (µ and I for each of the 3 RGB channels = 2x3=6). The results show that on CIFAR10 and CIFAR100, both CMG Channel-wise and CMG Element-wise can outperform Vanilla VGG-16, and channel-wise even outperforms element-wise with far less parameters, which is an interesting result.
Figure 4. Performance of CMG-Explicit IFM on VGG-16. The panels present the test accuracy of various input feature modulation strategies on the deep VGG-16 architecture for CIFAR10 and CIFAR100 datasets.
Figure 4. Performance of CMG-Explicit IFM on VGG-16. The panels present the test accuracy of various input feature modulation strategies on the deep VGG-16 architecture for CIFAR10 and CIFAR100 datasets.
Preprints 210779 g004

3.3. Explainability of CMG as an Activation Function

Having established the efficiency of CMG as IFM, we shift our focus to evaluating its potential as a core hidden-layer activation function. In standard MLP applied to image classification, the Rectified Linear Unit (ReLU) is overwhelmingly the most established and highly optimized activation function.(Nair and Hinton, 2010) To rigorously evaluate CMG in this role, we compared its performance against ReLU on CIFAR-10 and CIFAR-100 across a matched ablation of different normalization techniques (None, Tanh-wrapper, LayerNorm, and BatchNorm).(Ioffe and Szegedy, 2015; Ba et al., 2016) Here, Tanh-wrapper means pre-activation values are input to Tanh to map values to the range of [-1, 1], so that the input and output range of CMG can be fixed to [-1, 1] accordingly, and None means there is no normalization method applied. Specifically, we assign each hidden neuron a trainable CMG as modulator, and here CMG is rectified to imitate the rectification property of ReLU. Table 5 shows that in these 2 settings: (1) None on CIFAR10 and (2) Tanh in CIFAR100, CMG has only marginal accuracy increase compared to ReLU. While in all other cases, it can only achieve comparable performance as ReLU, but not outperforming it.
To investigate the reason why CMG can only achieve similar performance compared to ReLU, we analyzed the CMG parameter distribution when using no normalization methods. Figure 5 shows the CMG parameter distribution on CIFAR10 and CIFAR100, as can be seen the µ parameters are centered around µ=0.5 that is a value around which the rectified CMG approximates the ReLU. Given that ReLU is the most popular and simplest activation function in ANNs, this shows that CMG can adapt itself to learn the ‘best’ pattern of activation function, and this also explains why CMG achieves similar performance compared to ReLU in these tasks on MLP. An ablation test on the effect of placing of CMG in different parts of the neural networks for CIFAR10/100 is done in Supplementary Note 2.
Not satisfied from these tests that are conventional ‘sanity check’ of the performance of CMG-Explicit with respect to ReLU, we find that CMG activation function has potential to work on physics-informed neural networks (PINN), where neural networks are trained to imitate complex physical phenomenon described by partial differential equations (PDEs).(Raissi et al., 2019) This setting is particularly suitable for two reasons. First, PINNs are most commonly implemented using multilayer perceptrons as neural function approximators, providing a direct extension of the architectures considered in our previous experiments.(Fan and Chen, 2026) Second, Tanh— a sigmoid-type activation function— is widely adopted as the default choice in PINNs and is often reported to provide strong and stable performance across benchmark problems.(Zhao et al., 2024) Since CMG is a trainable generalized logistic-logit function, this setting allows use to assess whether CMG can improve performance over standard Tanh baseline.
The major evaluation metric in the field of PINN is Relative L2 Error (the lower, the better) describing the prediction error compared to the held-out data. We tested on Burger’s Equation. In the field of PINN, the most widely applied activation function is Tanh, and we compared the different normalization methods on Tanh and CMG.(Cuomo et al., 2022) The results are shown in Table 6. We can see that when we use Tanh-wrapper as normalization method, CMG can outperform Tanh by a magnitude (69.77% decrease in relative L2 error), which is a significant increase in performance.
Similarly, we test CMG as activation function in PINN on other tasks including Allen-Cahn and Diffusion-reaction.(Raissi et al., 2019) Except assigning each neuron a CMG activation function (neuron-wise) we also test assigning each layer a CMG activation function to reduce the number of parameters (layer-wise). The results are shown in Table 7. Surprisingly, although on Burgers, CMG Neuron-wise performs best and significantly decrease the error compared to Tanh, the CMG Layer-wise consistently outperform Tanh in all 3 tasks.
Similarly, to investigate the reason why CMG outperforms Tanh by a magnitude, we depict the distribution of learned parameters in CMG Neuron-wise of Burger’s Equation experiment in Figure 6. Unlike the CIFAR classification tasks, where the CMG parameters tightly clustered around µ = 0.5 to mimic the standard ReLU shape, in PINN The distribution of the inflection rate (µ) is widely spread out, indicating that the network explicitly avoids collapsing to a single standard shape. Instead, CMG learns more layer-specific shapes that deviate significantly from standard Tanh benchmark. Thus, in domains where the standard activation (like Tanh) is sub-optimal, CMG not only achieves superior predictive accuracy but explains this fact by functional adaptation during learning. An ablation test on the effect of placing of CMG in different parts of the neural networks for PINN can be found in Supplementary Note 3.

4. Discussion

The Cannistraci–Muscoloni–Gu Generalized Logistic–Logit Function (CMG-GLLF) offers a highly flexible framework for node modulation, but its original implicit logit-phase approximation suffered from severe computational overhead bottlenecks and numerical instability when applied beyond MLP architectures. In this study, the derivation of the explicit CMG formulation successfully bridges the gap between theoretical generalized functions and practical, scalable deep learning. By resolving the numerical stability issue and reducing computational overhead to near-vanilla-network level, CMG can seamlessly scale to more complex structural architectures—progressing from standard MLPs to simple CNNs and deep VGG-16 networks. Beyond input feature modulation, CMG is further utilized to act as hidden layer activation functions for physics-informed neural networks, generalizing it across diverse application domains.
A key finding of our investigation is the extreme parameter efficiency of CMG when scaling to deep convolutional architectures. The “channel-wise” Input Feature Modulator (IFM) application on VGG-16 proves that CMG does not need to be heavily parameterized to be highly effective. Modulating just the three RGB color channels, which adds only 6 learnable parameters to the entire network, surprisingly beats both the vanilla VGG-16 model and the more parameterized element-wise application.
When applied as a hidden-layer activation function in physics-informed neural networks (PINNs), we uncovered an interesting finding regarding parameter sharing. We found that the highly compressed layer-wise CMG activation consistently outperformed the standard Tanh baseline across all three PINN tasks (Burgers, Allen-Cahn, and Diffusion-Reaction). In contrast, the highly parameterized neuron-wise application’s performance was inconsistent across the tasks.
Furthermore, our results highlight the profound balance between explainability and performance when deploying the CMG as hidden layer activation function. In classical image classification tasks like CIFAR10/100, where a standard activation like ReLU is already highly optimal, the learned µ parameters in CMG distribute tightly around 0.5, naturally mimicking that exact ReLU-like behavior, thus achieving similar performance compared to ReLU. The primary value of the generalized function in such cases is explainability—mathematically proving what functional transformations the hidden nodes intend to perform. Conversely, in PINNs—the CMG explicitly avoids collapsing to a standard curve and instead learns widely spread-out, layer-specific shapes, achieving a magnitude better performance than Tanh baseline. This flexible adaptation drastically reduces relative L2 error, proving that CMG can simultaneously provide superior predictive performance and an interpretable explanation of the required node functional modulations in different region of the neural network.

5. Methods

5.1. CMG as IFM and Activation Function on CIFAR10/100

Dataset
To study the performance of CMG as activation function at input layers, CIFAR-10 and CIFAR-100 datasets are adopted as benchmark image classification datasets for evaluation.(Krizhevsky, 2009) Both of them consists of 32×32 colored images. CIFAR-10 contains 60,000 images evenly distributed across 10 classes, while CIFAR-100 contains the same number of images but divided into 100 classes with 600 images per class. Each dataset is split into 50,000 training and 10,000 test images.
Network architecture
Three different network architectures are tested. When testing on MLP backbone, for CIFAR10, we adopt MLP proposed in Wesselink et al., 2024, where there are 2 hidden layers of dimension 1024 and 512. (Wesselink et al., 2024) After each hidden layer there is a Dropout Layer with 0.3 dropout rate added for regularization. For CIFAR100, considering it is a more challenging task, we increase the hidden layer dimension by 2 to add more representational power to the networks, while the other settings remain the same as that of CIFAR10. To investigate the scalability of CMG to more complex structures, we first adopt a simple CNN. The CNN consisted of three convolutional blocks with output channels [32,64,128] followed by adaptive global average pooling and a final linear classifier. After that, we utilized a standard CIFAR-style VGG-16 backbone as a deeper and more complex structure to test. It consists of sequences of Conv-BatchNorm-ReLU blocks with a specific max-pooling schedule (M): [64, 64, M, 128, 128, M, 256, 256, 256, M, 512, 512, 512, M, 512, 512, 512, M], terminating in an adaptive average pooling layer (1×1) and a final linear layer mapping 512 features to the respective output dimensions.
Learning rate scheduler
In previous input feature modulator study, we designed supra-linear learning rate scheduler, and used a comparative study verifying it works better than linear scheduler and sub-linear scheduler, thus it is adopted here for the IFM studies.(Gu et al., 2025) Denoting L R c u r r e n t as current learning rate, L R i n i t as initial learning rate, T as total number of epochs, and t as current epoch, learning rate in supra-linear scheduler is
L R ( t ) = L R i n i t * 0.01 + 0.99 * 1 t 0.9 * T   i f   0 t 0.9 * T   L R i n i t * 0.01   i f   0.9 * T < t T
Optimizer
Muon is a recently-proposed optimizer, which has demonstrated strong empirical performance and improved convergence in training modern neural networks. In previous input feature modulator study, we verified that it works much better than the well-established SGD and AdamW, thus it is adopted here for the IFM studies. (Jordan, 2024; Gu et al., 2025)
Training setup
For all experiments, we perform a grid-search on hyper-parameters shown in Table 8, and run for 3 seeds, the evaluation metrics are averaged across 3 seeds to get a more robust result.

5.2. CMG in Physics-Informed Neural Networks

PINN Tasks
To evaluate the effectiveness of CMG in physics-informed neural networks (PINNs), we consider three standard benchmark partial differential equations (PDEs): the Burgers’ equation, the Allen–Cahn equation, and a diffusion–reaction equation. (Raissi et al., 2019) These problems represent different classes of physical processes. The Burgers’ equation models viscous fluid flow and traffic-like transport phenomena. Its expression is
u t + u u x = ν u x x , x Ω , t > 0 ,
where u(x,t) is the solution and ν is the viscosity coefficient. It exhibits shock formation and steep gradients, making it challenging for learning sharp solution features.
The Allen–Cahn equation arises in phase-field models of phase separation, describing the evolution of interfaces between different states in materials. Its expression is
u t = ϵ 2 u x x + u u 3 , x Ω , t > 0 ,
where ϵ controls interface width
The diffusion–reaction equation models the interplay between spatial diffusion and local reaction kinetics, and is commonly used to describe processes such as chemical reactions, population dynamics, and heat transfer with sources. Its representative form’s expression is
u t = D u x x + λ u 1 u .
where D is the diffusion coefficient and R(u) is a nonlinear reaction term.
Training setups
PINNs are usually trained using full-batch optimization without mini-batch scheduling, and MLP is the most common architecture there, so we adopt these conventions in our experiments. To navigate the notoriously difficult physical loss landscapes, we utilized a two-stage optimization procedure. Stage 1 consists of standard Adam iterations to rapidly descend into a favorable loss basin.(Kingma and Ba, 2017) Stage 2 refines the solution using a fixed-budget L-BFGS optimizer.(Nocedal, 1980) The learning rates for standard network weights and special GLLF parameters were searched independently. The exact iteration budgets and independent grid-search learning rates are detailed in Table 9. The training configurations are from PINN training tutorials in DeepXDE for each task. (Lu et al., 2021) For each experiment, we run 5 seeds, and take the average evaluation metrics of them for robust results.
The code for all the experiments in this script can be found in https://github.com/biomedical-cybernetics/CMG-GLLF.git

References

  1. Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer Normalization. [CrossRef]
  2. Burden, R. L., Faires, J. D., and Burden, A. M. (2016). Numerical analysis., Tenth edition. Boston, MA: Cengage Learning.
  3. Cuomo, S., Di Cola, V. S., Giampaolo, F., Rozza, G., Raissi, M., and Piccialli, F. (2022). Scientific Machine Learning Through Physics–Informed Neural Networks: Where we are and What’s Next. J Sci Comput 92, 88. [CrossRef]
  4. Fan, W., and Chen, X. (2026). Embedding Physics into Machine Learning: A Review of Physics Informed Neural Networks as Partial Differential Equation Forward Solvers. Tsinghua Science and Technology 31, 1326–1364. [CrossRef]
  5. Gu, W., Zhang, Y., Muscoloni, A., and Cannistraci, C. V. (2025). A Generalized Logistic-Logit Function and Its Application to Multi-Layer Perceptron and Neuron Segmentation. [CrossRef]
  6. Ioffe, S., and Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. [CrossRef]
  7. Joanna M. Papakonstantinou and Richard A. Tapia (2013). Origin and Evolution of the Secant Method in One Dimension. The American Mathematical Monthly 120, 500. [CrossRef]
  8. Jordan, K. (2024). Muon: An optimizer for hidden layers in neural networks. Available at: https://kellerjordan.github.io/posts/muon/ (Accessed April 24, 2026).
  9. Kelley, C. T. (2003). Solving Nonlinear Equations with Newton’s Method. Society for Industrial and Applied Mathematics. [CrossRef]
  10. Kingma, D. P., and Ba, J. (2017). Adam: A Method for Stochastic Optimization. [CrossRef]
  11. Krizhevsky, A. (2009). Learning Multiple Layers of Features from Tiny Images.
  12. Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324. [CrossRef]
  13. Lu, L., Meng, X., Mao, Z., and Karniadakis, G. E. (2021). DeepXDE: A Deep Learning Library for Solving Differential Equations. SIAM Rev. 63, 208–228. [CrossRef]
  14. Nair, V., and Hinton, G. E. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines.
  15. Nocedal, J. (1980). Updating quasi-Newton matrices with limited storage. Math. Comp. 35, 773–782. [CrossRef]
  16. Perlman, I., and Normann, R. A. (1998). Light adaptation and sensitivity controlling mechanisms in vertebrate photoreceptors. Progress in Retinal and Eye Research 17, 523–563. [CrossRef]
  17. Raissi, M., Perdikaris, P., and Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics 378, 686–707. [CrossRef]
  18. Sato, S., and Kefalov, V. J. (2025). Characterization of zebrafish rod and cone photoresponses. Sci Rep 15, 13413. [CrossRef]
  19. Simonyan, K., and Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. [CrossRef]
  20. Wesselink, W., Grooten, B., Xiao, Q., Campos, C. de, and Pechenizkiy, M. (2024). Nerva: a Truly Sparse Implementation of Neural Networks. [CrossRef]
  21. Zhao, C., Zhang, F., Lou, W., Wang, X., and Yang, J. (2024). A comprehensive review of advances in physics-informed neural networks and their applications in complex fluid dynamics. Physics of Fluids 36, 101301. [CrossRef]
Figure 1. CMG-Explicit curve. Here shows how inflection parameter μ and deviate inflection point I affect the shape of CMG curve. When μ = 0 , CMG is a step function and the discontinuity happens at inflection point; when 0 < μ < 0.5 it is generalized logistic function, lower μ results in a steeper curve; when μ = 0.5 , CMG becomes a linear function; when 0.5 < μ < 1 , it is a generalized logit function (inverse logistic function) and higher the μ is, more gradual the curve becomes; when μ = 1 , function values remain constant at inflection value. I determines the asymmetry of the curve, specifically at what proportion of x range the deviate inflection point occurs. In this figure, other parameters are set as: x m i n = y L = 0 , x m a x = y R = 1 .
Figure 1. CMG-Explicit curve. Here shows how inflection parameter μ and deviate inflection point I affect the shape of CMG curve. When μ = 0 , CMG is a step function and the discontinuity happens at inflection point; when 0 < μ < 0.5 it is generalized logistic function, lower μ results in a steeper curve; when μ = 0.5 , CMG becomes a linear function; when 0.5 < μ < 1 , it is a generalized logit function (inverse logistic function) and higher the μ is, more gradual the curve becomes; when μ = 1 , function values remain constant at inflection value. I determines the asymmetry of the curve, specifically at what proportion of x range the deviate inflection point occurs. In this figure, other parameters are set as: x m i n = y L = 0 , x m a x = y R = 1 .
Preprints 210779 g001
Figure 2. Comparison of test accuracy between CMG-Explicit input feature modulator and other methods on MLP. Left panel for CIFAR10 results, right panel for CIFAR100.
Figure 2. Comparison of test accuracy between CMG-Explicit input feature modulator and other methods on MLP. Left panel for CIFAR10 results, right panel for CIFAR100.
Preprints 210779 g002
Figure 3. Test accuracy comparison between CMG-Implicit, Vanilla CNN, Linear and CMG-Explicit on Simple CNN.
Figure 3. Test accuracy comparison between CMG-Implicit, Vanilla CNN, Linear and CMG-Explicit on Simple CNN.
Preprints 210779 g003
Figure 5. CMG Parameter Distributions as Hidden Layer Activation Functions on (A) CIFAR10 and (B) CIFAR100.
Figure 5. CMG Parameter Distributions as Hidden Layer Activation Functions on (A) CIFAR10 and (B) CIFAR100.
Preprints 210779 g005
Figure 6. Neuron-wise CMG Parameter Distribution on PINN for solving Burger’s Equation.
Figure 6. Neuron-wise CMG Parameter Distribution on PINN for solving Burger’s Equation.
Preprints 210779 g006
Table 1. Comparison of computational overhead between CMG-Explicit with other methods on MLP. Peak GPU memory in MB (GPU Mem) and estimated training time in hours (Time) are recorded on tasks CIFAR10 and CIFAR100.
Table 1. Comparison of computational overhead between CMG-Explicit with other methods on MLP. Peak GPU memory in MB (GPU Mem) and estimated training time in hours (Time) are recorded on tasks CIFAR10 and CIFAR100.
Method CIFAR10 CIFAR100
GPU Mem (MB) Time (h) GPU Mem (MB) Time (h)
Vanilla MLP 135 0.61 226 0.64
Linear 135 0.65 226 0.68
CMG-Implicit 1608 1.04 1645 1.03
CMG-Explicit 135 0.72 226 0.73
Table 2. Comparison of numerical stability between CMG-Explicit with other methods on MLP. The occurrence of events including non-finite loss (NF Loss), non-finite gradient (NF Grad) and non-finite parameters (NF Param) during training on CIFAR10 and CIFAR100 are recorded.
Table 2. Comparison of numerical stability between CMG-Explicit with other methods on MLP. The occurrence of events including non-finite loss (NF Loss), non-finite gradient (NF Grad) and non-finite parameters (NF Param) during training on CIFAR10 and CIFAR100 are recorded.
Method CIFAR10 CIFAR100
NF Loss NF Grads NF Params NF Loss NF Grads NF Params
Vanilla MLP 0 0 0 0 0 0
Linear 0 0 0 0 0 0
CMG-Implicit 0 0 0 0 0 0
CMG-Explicit 0 0 0 0 0 0
Table 3. Comparison of numerical stability between CMG-Explicit and other methods on Simple CNN.
Table 3. Comparison of numerical stability between CMG-Explicit and other methods on Simple CNN.
Method CIFAR-10 CIFAR-100
NF Loss NF Grads NF Params NF Loss NF Grads NF Params
Vanilla CNN 0 0 0 0 0 0
Linear 0 0 0 0 0 0
CMG-Implicit 0 27817 27817 0 78833 78833
CMG-Explicit 0 0 0 0 0 0
Table 4. Comparison of computational overhead between CMG-Explicit and other methods on Simple CNN.
Table 4. Comparison of computational overhead between CMG-Explicit and other methods on Simple CNN.
Method CIFAR-10 CIFAR-100
Mem (MB) Time (h) Mem (MB) Time (h)
Vanilla CNN 320 0.89 320 0.93
Linear 321 0.96 321 0.95
CMG-Implicit 1581 1.27 1581 1.33
CMG-Explicit 328 1.05 328 1.01
Table 5. Normalization Ablation on CIFAR. Comparison of test accuracy (%) between ReLU and CMG-Explicit as hidden-layer activation functions across various normalization methods on CIFAR10 and CIFAR100 datasets. The highest accuracy between the two activation functions for each setting is highlighted in red.
Table 5. Normalization Ablation on CIFAR. Comparison of test accuracy (%) between ReLU and CMG-Explicit as hidden-layer activation functions across various normalization methods on CIFAR10 and CIFAR100 datasets. The highest accuracy between the two activation functions for each setting is highlighted in red.
Normalization Method CIFAR10 CIFAR100
ReLU CMG-Explicit ReLU CMG-Explicit
None 67.65 67.99 41.49 41.43
Tanh 65.85 65.74 40.72 40.82
LayerNorm 69.66 69.4 43.62 43.46
BatchNorm 69.98 69.59 45.86 45.4
Table 6. Relative L2 Error (the lower, the better) of Tanh and CMG-Explicit as activation function under different normalization methods on Burger’s Equation. Best performance for each activation function is reported in red font.
Table 6. Relative L2 Error (the lower, the better) of Tanh and CMG-Explicit as activation function under different normalization methods on Burger’s Equation. Best performance for each activation function is reported in red font.
Normalization Method Relative L2 Error
Tanh CMG-Explicit
None 2.10e-02 2.35e-01
Tanh-wrapper N/A 6.34e-03
LayerNorm 6.70e-01 1.11e+00
BatchNorm 5.32e-01 7.01e-01
Table 7. Relative L2 Error of Tanh and CMG-Explicit as activation function on different tasks. Best performance for each activation function is reported in red font.
Table 7. Relative L2 Error of Tanh and CMG-Explicit as activation function on different tasks. Best performance for each activation function is reported in red font.
Task Activation Strategy Relative L2 Error
Burgers Tanh 2.10e-02
CMG Neuron-wise 6.34e-03
CMG Layer-wise 8.83e-03
Allen-Cahn Tanh 1.68e-02
CMG Neuron-wise 1.51e-02
CMG Layer-wise 1.36e-02
Diffusion-Reaction Tanh 4.33e-03
CMG Neuron-wise 5.36e-03
CMG Layer-wise 3.62e-03
Table 8. Training configurations.
Table 8. Training configurations.
Batch sizes 32, 64, 128
Initial learning rates 0.025, 0.01, 0.001
Seeds 0, 1, 2
Momentum 0.95
Weight decay 0.01
Epochs CIFAR10/100 & Simple CNN: 150
VGG-16: 300
Table 9. Hyperparameter Configurations for Physics-Informed Neural Networks (PINNs).
Table 9. Hyperparameter Configurations for Physics-Informed Neural Networks (PINNs).
Task Network structure Adam iters L-BFGS iters Learning rates
Burgers [2,20,20,20,1] 15,000 15,000 2e-3, 1e-3, 5e-4
Allen-Cahn [2,20,20,20,1] 40,000 15,000 2e-3, 1e-3, 5e-4
Diffusion-Reaction [2,30,30,30,30,30,30,1] 20,000 None 2e-3, 1e-3, 5e-4
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated