Submitted:
13 June 2025
Posted:
16 June 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Investigating and visualizing the challenge encountered by the loss function of an MLP in overcoming to escape local minima to attain the global minimum.
- Deriving the explicit scaling law for MLPs through error bounds and conducting experimental validation on regression tasks.
- Overcoming the limitations of neural networks in fitting multi-frequency data, thereby making LReg more versatile for complex applications.
- Applying LReg as a PEFT method for large models (not confined to transformer architectures) to investigate the potential of guiding their loss functions toward global minima, consequently further reducing test loss and improving the performance of pre-trained models.
2. Methods
2.1. Preliminary: Function Fitting via MLP
2.2. Optimization: Discrete vs. Continuous Processes
2.3. Lagrange Basis Function
2.4. Mesh-Refinement-Coarsening Loop

| Algorithm 1:Mesh-Refinement-Coarsening Loop |
|
2.5. Scaling Law and Error-Bound Formula
2.6. LReg Applied to Large Models
3. Results
3.1. Scaling Law and Addressing Regression Challenges
3.2. Comparison with Regressors: Training from Scratch
- : Generated from the distribution , with 1000 training examples and 200 test examples. The LReg was trained with a learning rate of 0.1.
- : Generated from the distribution , with 1000 training examples and 200 test examples. We trained the LReg with a learning rate of 0.9.
-
: Generated from the distribution, with 7,500 training examples and 1,500 test examples. The LReg was trained with a learning rate of 0.1.
-
: Generated from the distribution, with 50,000 training examples and 10,000 test examples. Training utilized a learning rate of 0.9.
3.3. Improving pre-trained Model Performance and Continuously Reducing Loss
3.4. Transfer Learning via LReg Connection
| MNLI | SST-2 | MRPC | QNLI | QQP | RTE | |||||||
| Method | #Params | Accuracy | #Params | Accuracy | #Params | Accuracy | #Params | Accuracy | #Params | Accuracy | #Params | Accuracy |
| FT* | 125M | 87.6 | 125M | 94.8 | 125M | 90.2 | 125M | 92.8 | 125M | 91.9 | 125M | 78.7 |
| BitFit* | 0.1M | 84.7 | 0.1M | 93.7 | 0.1M | 92.7 | 0.1M | 91.8 | 0.1M | 84.0 | 0.1M | 81.5 |
| LoRA* | 0.295M | 0.295M | 0.295M | 0.295M | 0.295M | 0.295M | ||||||
| LReg | 0.259M | 0.259M | 0.259M | 0.259M | 0.259M | 0.259M | ||||||
| GPT2 | GPT2-Medium | GPT2-Large | Roberta-Base | |||||||||
| Method | #Params | PPL | Seq/s | #Params | PPL | Seq/s | #Params | PPL | Seq/s | #Params | PPL | Seq/s |
| FT | 124.4M | 21.4234 | 6.98 | 354.8M | 15.8900 | 3.50 | 774.0M | 13.8468 | 1.58 | 124.7M | 3.6415 | 10.05 |
| LoRA | 0.295M | 21.4188 | 12.90 | 0.393M | 15.8873 | 6.29 | 0.737M | 13.8486 | 1.67 | 0.295M | 3.6425 | 9.85 |
| LReg | 0.203M | 21.3401 | 9.58 | 0.404M | 15.8674 | 6.66 | 0.607M | 13.8443 | 2.52 | 0.303M | 3.6414 | 12.37 |
4. Discussion
5. Conclusions
Abbreviations
| FEM | Finite Element Method |
| LReg | Lagrange Regressor |
| MAE | Mean Absolute Error |
| MLP | Multi-Layer Perceptron |
| MRC | Mesh-Refinement-Coarsening |
| PCA | Principal Component Analysis |
| PEFT | Parameter-Efficient Fine-Tuning |
| SVM | Support Vector Machine |
Appendix A Mesh and Lagrange Basis Function
Appendix A.1. Triangulated Mesh

Appendix A.2. Lagrange Basis Function
Appendix A.3. Proof of Lagrange Basis Expression
Appendix A.4. Validation of Lagrange Basis Functions’ Mathematical Expression
Appendix A.5. Visualization of Lagrangian basis

Appendix B Additional Experiments
| Model | d | n | # Params | Acc@1 (%) | Acc@5 (%) |
|---|---|---|---|---|---|
| MobileNet-V2 | 32 | 8 | 0.298 M (+8.51%) | ||
| ResNet50 | 32 | 8 | 0.324 M (+1.27%) | ||
| ResNeXt50 | 32 | 8 | 0.324 M (+1.30%) | ||
| ViT-B-16 | 8 | 32 | 1.032 M (+1.19%) |
| Coeff | (4, 4) | (4, 8) | (4, 16) | (4, 32) | (8, 4) | (8, 8) | (8, 16) | (8, 32) |
|---|---|---|---|---|---|---|---|---|
| # Parameters | 36,868 | 69,636 | 135,172 | 266,244 | 73,736 | 139,272 | 270,344 | 532,488 |
| Acc@1 | 76.226 | 76.226 | 76.242 | 76.222 | 76.238 | 76.256 | 76.268 | 76.276 |
| Acc@5 | 92.952 | 92.960 | 92.948 | 92.972 | 92.964 | 92.954 | 92.966 | 92.952 |
| Speed (img/s) | 405.31 | 411.18 | 390.26 | 418.84 | 382.22 | 394.63 | 409.52 | 404.43 |
| Coeff | (16, 4) | (16, 8) | (16, 16) | (16, 32) | (32, 4) | (32, 8) | (32, 16) | (32, 32) |
| # Parameters | 147,472 | 278,544 | 540,688 | 1,064,976 | 294,944 | 557,088 | 1,081,376 | 2,129,952 |
| Acc@1 | 76.272 | 76.244 | 76.248 | 76.284 | 76.298 | 76.272 | 76.248 | 76.294 |
| Acc@5 | 92.940 | 92.946 | 92.968 | 92.958 | 92.956 | 92.958 | 92.968 | 92.932 |
| Speed (img/s) | 393.19 | 394.86 | 379.72 | 380.30 | 393.34 | 414.39 | 379.72 | 368.94 |
Appendix C Additional Applications
Appendix C.1. Solve PDEs

Appendix C.2. fitting high-noise data

Appendix C.3. fitting multi-frequency data

Appendix C.4. Fit a Vector-Valued Function

References
- Saxe, A.M.; McClelland, J.L.; Ganguli, S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv 2013, arXiv:1312.6120. [Google Scholar]
- Dauphin, Y.N.; Pascanu, R.; Gulcehre, C.; Cho, K.; Ganguli, S.; Bengio, Y. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in neural information processing systems 2014, 27. [Google Scholar]
- Goodfellow, I.J.; Vinyals, O.; Saxe, A.M. Qualitatively characterizing neural network optimization problems. arXiv 2014, arXiv:1412.6544. [Google Scholar]
- Choromanska, A.; Henaff, M.; Mathieu, M.; Arous, G.B.; LeCun, Y. The loss surfaces of multilayer networks. In Proceedings of the Artificial intelligence and statistics. PMLR; 2015; pp. 192–204. [Google Scholar]
- Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar]
- Henighan, T.; Kaplan, J.; Katz, M.; Chen, M.; Hesse, C.; Jackson, J.; Jun, H.; Brown, T.B.; Dhariwal, P.; Gray, S.; et al. Scaling laws for autoregressive generative modeling. arXiv 2020, arXiv:2010.14701. [Google Scholar]
- Zienkiewicz, O.C.; Taylor, R.L.; Nithiarasu, P.; Zhu, J. The finite element method; Vol. 3, Elsevier, 1977.
- Xu, Z.Q.J.; Zhang, Y.; Xiao, Y. Training behavior of deep neural network in frequency domain. In Proceedings of the International Conference on Neural Information Processing. Springer; 2019; pp. 264–274. [Google Scholar]
- Platt, J.; et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers 1999, 10, 61–74. [Google Scholar]
- Thiel, H. A rank-invariant method of linear and polynomial regression analysis. I, II, III. Nederl. Akad. Wetensch., Proc. 53 1950, pp. 386–392.
- Cantzler, H. Random sample consensus (ransac). Institute for Perception, Action and Behaviour, Division of Informatics, University of Edinburgh 1981.
- Zhang, T. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the Proceedings of the twenty-first international conference on Machine learning, 2004, p. 116.
- Hilt, D.E.; Seegrist, D.W. Ridge, a computer program for calculating ridge regression estimates; Vol. 236, Department of Agriculture, Forest Service, Northeastern Forest Experiment. 1977.
- Stone, M. Cross-validatory choice and assessment of statistical predictions. Journal of the royal statistical society: Series B (Methodological) 1974, 36, 111–133. [Google Scholar] [CrossRef]
- Jain, P.; Kakade, S.M.; Kidambi, R.; Netrapalli, P.; Sidford, A. Accelerating stochastic gradient descent for least squares regression. In Proceedings of the Conference On Learning Theory. PMLR; 2018; pp. 545–604. [Google Scholar]
- Murphy, K.P. Machine learning: a probabilistic perspective; MIT press, 2012.
- Friedman, J.H. Greedy function approximation: a gradient boosting machine. Annals of statistics 2001, 1189–1232. [Google Scholar] [CrossRef]
- Breiman, L. Random forests. Machine learning 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. Advances in neural information processing systems 2015, 28. [Google Scholar]
- Wightman, R.; Touvron, H.; Jégou, H. Resnet strikes back: An improved training procedure in timm. arXiv 2021, arXiv:2110.00476. [Google Scholar]
- Touvron, H.; Vedaldi, A.; Douze, M.; Jégou, H. Fixing the train-test resolution discrepancy. Advances in neural information processing systems 2019, 32. [Google Scholar]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International conference on machine learning. PMLR; 2021; pp. 10347–10357. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. International journal of computer vision 2015, 115, 211–252. [Google Scholar] [CrossRef]
- TorchVision Contributors. TorchVision Models. https://pytorch.org/vision/master/models.html, 2024. Accessed: 2024-08-09.
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.
- Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp.
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
- Hyeon-Woo, N.; Ye-Bin, M.; Oh, T.H. Fedpara: Low-rank hadamard product for communication-efficient federated learning. arXiv 2021, arXiv:2108.06098. [Google Scholar]
- Liu, H.; Tam, D.; Muqeeth, M.; Mohta, J.; Huang, T.; Bansal, M.; Raffel, C.A. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems 2022, 35, 1950–1965. [Google Scholar]
- Zaken, E.B.; Ravfogel, S.; Goldberg, Y. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv 2021, arXiv:2106.10199. [Google Scholar]
- Wang, A. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv 2018, arXiv:1804.07461. [Google Scholar]
- Merity, S.; Xiong, C.; Bradbury, J.; Socher, R. Pointer sentinel mixture models. arXiv 2016, arXiv:1609.07843. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I.; et al. Language models are unsupervised multitask learners. OpenAI blog 2019, 1, 9. [Google Scholar]
- Liu, Y. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692364, arXiv–1907.11692. [Google Scholar]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, 2020, pp. 38–45.





| METHOD | ||||||||
|---|---|---|---|---|---|---|---|---|
| OLS Linear | 0.037 | 0.042 | 0.963 | 0.951 | 0.085 | 0.092 | 0.984 | 0.984 |
| Theil-Sen | -44.7 | -54.4 | 0.958 | 0.946 | -0.41 | -3.81 | 0.982 | 0.982 |
| RANSAC | -1.21 | -1.43 | 0.963 | 0.951 | -27.0 | -27.1 | 0.983 | 0.983 |
| Huber | 0.036 | 0.041 | 0.962 | 0.949 | 0.085 | 0.092 | 0.984 | 0.984 |
| Ridge | 0.031 | 0.038 | 0.963 | 0.951 | 0.055 | 0.061 | 0.984 | 0.984 |
| RidgeCV | 0.037 | 0.042 | 0.963 | 0.951 | 0.085 | 0.092 | 0.984 | 0.984 |
| SGD | 0.009 | 0.01 | 0.962 | 0.95 | 0.005 | 0.004 | 0.983 | 0.983 |
| KRR | 0.0036 | 0.04 | 0.97 | 0.962 | 0.056 | 0.051 | 0.993 | 0.992 |
| SVR | 0.11 | 0.101 | 0.97 | 0.962 | 0.29 | 0.308 | 0.992 | 0.992 |
| Voting | 0.852 | 0.852 | 0.942 | 0.917 | 0.869 | 0.868 | 0.951 | 0.946 |
| LReg | 1.0 | 1.0 | 0.971 | 0.963 | 0.999 | 0.999 | 0.992 | 0.992 |
| Model | Method | # Trainable | Speed | Training | ||
|---|---|---|---|---|---|---|
| Parameters | Acc@1 | Acc@5 | (img/s) | Time (total) | ||
| MobileNet-V2 | Baseline* | 3.5 M | 71.878 | 90.286 | 701.51 | 16h 37m |
| LReg | 328,720 | 71.934 | 90.268 | 685.17 | 32m 12s | |
| ResNet-50 | Baseline* | 25.6 M | 76.130 | 92.862 | 440.18 | 2d 1h 15m |
| LReg | 540,688 | 76.274 | 92.932 | 411.50 | 40m 42s | |
| ResNeXt-50_32x4d | Baseline* | 25.0 M | 77.618 | 93.698 | 334.72 | 3d 1h 32m |
| LReg | 135,172 | 77.650 | 93.672 | 315.33 | 51m 9s | |
| ViT-B16 | Baseline* | 86.6 M | 81.072 | 95.318 | 331.02 | 3d 3h 26m |
| LReg | 67,076 | 81.082 | 95.316 | 316.31 | 56m 54s |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).