Preprint
Article

This version is not peer-reviewed.

Solve Bi-Level Optimization Model for Meta-Learning Using Method of Lagrangian Multipliers

Submitted:

27 April 2025

Posted:

28 April 2025

You are already at the latest version

Abstract
Optimization-based meta-learning has emerged as a powerful framework for improving model generalization, especially in domains with diverse and heterogeneous data distributions. In this work, we propose a bilevel optimization model for meta-learning, explicitly framed through an optimal control perspective. Our approach formulates the meta-training process as a constrained optimization problem, where the lower-level updates task-specific models using a learnable unrolling network, and the upper-level adjusts hyperparameters to minimize validation losses across tasks. By applying the Method of Lagrangian Multipliers (MLM), we model both the primal reconstruction variables and the dual multipliers, ensuring that updates respect the dynamic constraints of the optimization process. We prove the theoretical equivalence between direct loss minimization and Lagrangian-based optimization and develop an efficient algorithm for network training. Experimental motivations drawn from magnetic resonance imaging (MRI) reconstruction suggest that our framework offers scalable and principled solutions, with potential for broader impact in general inverse problems and meta-learning scenarios.
Keywords: 
;  ;  

1. Introduction

Meta-learning, often referred to as "learning to learn," [1,2,3] has emerged as a powerful framework for improving model generalization, particularly in domains where data distributions vary significantly across tasks.
Inspired by these advances[4], our proposed model integrates task correlation learning directly into the optimization procedure, seeking not only to minimize training loss but also to enhance the quality of downstream decision-making through better generalization. In magnetic resonance imaging (MRI) reconstruction, where datasets can differ widely in anatomy, acquisition protocols, and imaging artifacts, meta-learning provides a promising avenue to build adaptable models. Building on the foundation of optimization-based deep learning methods for MRI reconstruction[5,6] , we propose a bi-level optimization model for meta-learning that explicitly captures the relationship between training and validation datasets through learnable task correlations.
This work is motivated by a line of recent studies exploring optimization-driven and meta-learning approaches, including an optimization-based meta-learning model for diverse datasets[7,8,9] and a learnable variational model for joint multimodal MRI reconstruction and synthesis[10]. These advances illustrate that incorporating optimization formulations into the learning process enhances model robustness across diverse conditions.
Our framework formulates the meta-training as a constrained bi-level optimization problem, where the lower-level problem learns task-specific models weighted by a normalized hyperparameter matrix, and the upper-level problem adjusts these hyperparameters to minimize validation losses. Furthermore, we extend the training dynamics using the method of Lagrangian multipliers (MLM) to provide theoretical equivalence between direct loss minimization and primal-dual optimization updates, following similar optimization principles as used in optimal control frameworks for image processing[11,12].
By synthesizing principles from optimization theory, control dynamics[13], and deep learning, our approach offers a principled and scalable solution for meta-learning training algorithm, potentially improving performance on heterogeneous, real-world imaging datasets. This work continues the trajectory of optimization-based learning[9,12,14].

1.1. Optimization-Based Meta-Learning: Frameworks and Algorithms

In optimization-based meta-learning, the learner is not just learning parameters — it’s learning how to optimize.
Instead of using a fixed optimization method (like SGD or Adam), you train an optimizer (or regularizer) so that across many tasks, it adapts quickly and solves new tasks better. This is typically framed as a bi-level optimization problem:
min θ E T p ( T ) L meta ( θ ; T )
where:
  • θ denotes the parameters of the optimizer, model, or regularizer,
  • T p ( T ) represents tasks sampled from a distribution p ( T ) ,
  • L meta ( θ ; T ) measures how well the task is solved after a few optimization steps.
At each meta-training iteration:
  • Sample a batch of tasks T i from the task distribution.
  • For each task T i :
    • Initialize model parameters x (e.g., randomly or from a pre-trained model).
    • Inner Loop: Solve the task-specific optimization problem for a few steps using the current optimizer (parameterized by θ ), yielding adapted parameters x i .
    • Compute the task loss L task ( x i ; T i ) .
  • Outer Loop:
    • Aggregate all the task losses to compute a meta-loss.
    • Update θ (the optimizer or model parameters) via gradient descent to improve performance on future tasks.

2. Problem Settings

Suppose we are given a set of N training examples, denoted as { ( x ( j ) , u * ( j ) ) } j = 1 N , where for each index j { 1 , , N } :
  • x ( j ) represents the input data (e.g., an undersampled measurement or an input image),
  • u * ( j ) denotes the corresponding ground-truth output (e.g., a fully sampled or high-quality image).
We aim to train a neural network parameterized by Θ , where the objective is to learn Θ by minimizing a loss function that measures the discrepancy between the network’s output and the ground truth across the training set.
This training procedure can be formulated as a bilevel optimization problem[15], consisting of two levels:
  • Lower-level optimization: For a fixed set of trainable parameters Θ , we update the reconstruction variable u by solving a task-specific optimization problem guided by the network.
  • Upper-level optimization: After updating u , we update the network parameters Θ by minimizing the empirical loss over the training dataset.
Thus, the network training involves alternating between solving for u given Θ , and optimizing Θ based on the performance measured by the loss function .
min Θ ( u ( T ) ) ,
s . t . u ( t ) = g ( u ( t 1 ) , θ t ) , t = 1 T ,
u ( 0 ) = u 0 ,
Set U = ( u ( 0 ) , , u ( T ) ) and Θ = ( θ 1 , , θ T ) be the collection of states u ( t ) and controls θ t at all time steps respectively. g is a multi-phase unrolling network inspired by the proximal gradient algorithm, and the output of g ( · ) C m × n × c is the updated multi-coil MRI data from each phase. In our framework, we introduce a neural network g that serves as an intermediate mapping between consecutive states of the reconstruction variable. Specifically, at each iteration t (where t = 0 , , T 1 ), the network g takes the current estimate u ( t ) and produces an updated estimate u ( t + 1 ) :
u ( t + 1 ) = g ( u ( t ) ; θ ( t ) ) ,
where θ ( t ) denotes the set of trainable parameters at iteration t.
To initialize the reconstruction process, we define a separate network g 0 , which operates with an initial set of control parameters θ ( 0 ) . The role of g 0 is to map the given partial k-space measurements f to an initial image reconstruction u ( 0 ) :
u ( 0 ) = g 0 ( f ; θ ( 0 ) ) .
This initial estimate u ( 0 ) then serves as the starting point for the iterative reconstruction process governed by the optimal control system, where g is applied sequentially to refine the reconstruction across T iterations.
Suppose the training data D t r = { D τ i t r } i = 1 N which consists of N batches.
And the validation data D v a l = { D v j v a l } j = 1 M which consists of M batches.
——————————–
Let θ be the collection of the network parameters and α be the set of hyper-parameters. We optimize the following bi-level model:
α ^ = arg min α j = 1 M v j ( θ ( α ) ; D v j v a l )
s . t . θ ( α ) = arg min θ ¯ i = 1 N σ ( α ) i · τ i ( θ ¯ ; D τ i t r ) ,
where i σ ( α ) i = 1 , 0 σ ( α ) i 1 .
where denotes the loss function which can be taken as cross entropy loss. And the function σ is to normalize the hyperparameter vector α , here we can simply take σ as the softmax function.
Also we propose a generalized model where α R N × M , and we denote α j R N denotes the j-th column. After we apply the normalization function σ (in (3c)), the matrix σ ( α ) can be regarded as a weight matrix. Intuitively and ideally, each component σ ( α j ) i can represent the correlation between D τ i t r and D v j v a l .
α ^ = arg min α j = 1 M v j ( θ ( α j ) ; D v j v a l )
s . t . θ ( α j ) = arg min θ ¯ i = 1 N σ ( α j ) i · τ i ( θ ¯ ; D τ i t r ) ,
where i σ ( α j ) i = 1 , 0 σ ( α j ) i 1 .
Algorithm 1:
1:
Input: Initialize hyper-parameters α 0 and network parameters θ 0 .
2:
for epoch p = 1 , 2 , , P  do
3:
    θ p = a r g   m i n θ ¯ i = 1 N σ ( α p 1 ) i · τ i ( θ ¯ ; D τ i tr ) ⊳ Perform (k steps) gradient update steps in SGD on D τ i tr .
4:
    θ i p = a r g   m i n θ α i p 1 · τ i ( θ , D τ i tr )
5:
    α p = a r g   m i n α j = 1 M v j ( θ p ; D v j val )     ⊳ Perform gradient update steps in SGD on D v i val .
6:
end for
7:
Output:  α P , θ P .

3. Network Training from the View of Method of Lagrangian Multipliers (MLM)

The network parameters to be solved from (3) are Θ = { θ t : t = 1 , T } . First, we apply MLM to solve the control problem in Section 2. The Lagrangian of the control problem (3) is
L ( U , Θ ; Λ ) = ( u ( T ) ) + t = 1 T λ t , u ( t ) g ( u ( t 1 ) , θ t ) + λ 0 , u ( 0 ) u 0 ,
where Λ = ( λ 0 , , λ T ) are Lagrangian multipliers of (3).
We want to find primal solution ( U * , Θ * ) and dual solution Λ * to solve (3), if ( U * , Θ * ; Λ * ) minimize (4), by the first order optimality condition, we have
(5a) U L ( U * , Θ * ; Λ * ) = 0 (5b) Θ L ( U * , Θ * ; Λ * ) = 0 (5c) Λ L ( U * , Θ * ; Λ * ) = 0
The algorithm is conducted in the following manner:
  • Fix Θ and define ( U ¯ , Λ ¯ ) : = min U , Λ L ( U , Θ ; Λ ) , then by the first order optimality condition of λ t , t = 0 , , T we get
    λ 0 λ 0 , u ( 0 ) u 0 = 0 u ¯ ( 0 ) = u 0
    λ t λ t , u ( t ) g ( u ( t 1 ) , θ t ) = 0 u ¯ ( t ) = g ( u ¯ ( t 1 ) , θ t ) , t = 1 , , T ,
    By the first order optimality condition of u ( t ) , t = 1 , , T we get
    u ( T ) [ ( u ( T ) ) + λ T , u ( T ) ] = 0 λ ¯ T = u ( T ) ( u ¯ ( T ) ) ,
    u ( t ) [ λ t , u ( t ) + λ t + 1 , g ( u ( t ) , θ t + 1 ) ] = 0 λ ¯ t = λ ¯ t + 1 , u ( t ) g ( u ¯ ( t ) , θ t + 1 ) , t = 1 , , T 1 ,
    u ( 0 ) [ λ 0 , u ( 0 ) + λ 1 , g ( u ( 0 ) , θ 1 ) ] = 0 λ ¯ 0 = λ ¯ 1 , u ( 0 ) g ( u ¯ ( 0 ) , θ 1 ) .
    Therefore for fixed Θ we get the optimal solutions ( U ¯ , Λ ¯ ) satisfy (1) and (1).
  • For fixed ( U , Λ ) , we compute the gradient Θ L ( U , Θ ; Λ ) :
    Θ L ( U k , Θ k ; Λ k ) = λ 0 k , θ 0 ( u k ( 0 ) u 0 k ) = 0 , for t = 0 .
    θ t L ( U , θ t ; Λ ) = θ t [ λ t , g ( u ( t 1 ) , θ t ) ]
    = λ t , θ t g ( u ( t 1 ) , θ t ) for t = 1 , , T .
Recall the conventional SGD method, we update Θ k + 1 = Θ k η n i = 1 n Θ ( u i k ( T ) ( Θ k ) ) , where u k ( T ) is interpreted as a function of Θ k = ( θ 1 k , , θ T k ) .
Preprints 157419 i001
In the following theorem, we show the equivalence between Θ L ( U ¯ , Θ ; Λ ¯ ) and Θ ( u ¯ ( T ) ( Θ ) ) .
Theorem 1.
Θ L ( U ¯ , Θ ; Λ ¯ ) = Θ ( u ¯ ( T ) ( Θ ) ) .
Proof. 
From (6) we have
θ t g ( u ¯ ( t 1 ) , θ t ) = θ t u ¯ ( t ) , t = 1 , , T
From (7b), we have
λ ¯ T 1 = λ ¯ T , u ( T 1 ) g ( u ¯ ( T 1 ) , θ T )
= u ( T ) ( u ¯ ( T ) ) , u ( T 1 ) u ¯ ( T ) by ( 7 a ) , ( 6 b )
= u ( T 1 ) ( u ¯ ( T ) )
Similarly, we can show (Section 3) holds for t = 0 , , T 1
λ ¯ t = u ( t ) ( u ¯ ( T ) ) ,
together with (7a), we have
λ ¯ t = u ( t ) ( u ¯ ( T ) ) , t = 0 , , T .
Hence, (8c) reduces to
θ t L ( U ¯ , θ t ; Λ ¯ ) = λ ¯ t , θ t g ( u ¯ ( t 1 ) , θ t )
= u ( t ) ( u ¯ ( T ) ) , θ t u ¯ ( t ) by ( 12 ) , ( 9 a )
= θ t ( u ¯ ( T ) ( θ t ) ) , t = 1 , , T .
Thus, we derive Θ L ( U ¯ , Θ ; Λ ¯ ) = Θ ( u ¯ ( T ) ( Θ ) ) . □
This theorem shows that applying SGD on the loss function is equivalent to performing SGD on L. We want to reach the optimal solutions ( U ¯ , Λ ¯ ) with fixed Θ , so we proceed the algorithm to update Θ in the iterative scheme by replacing notation U ¯ , Λ ¯ as U k , Λ k . This training process is summarized in Algorithm 3, where Λ k = ( λ 0 k , λ 1 k , , λ T k ) , Θ k = ( θ 1 k , , θ T k ) , and U k = ( u k ( 0 ) , u k ( 1 ) , , u k ( T ) ) for each iteration k = 0 , 1 , , K . Now we can use one step of SGD optimization methods such as ADAM to update Θ by replacing Θ ( u k ( T ) ( Θ ) ) to be Θ L ( U k , Θ ; Λ k ) with gradient computed in (8).
Preprints 157419 i002
Algorithm 3 trains a meta-learning model by solving a bilevel optimization problem using the Method of Lagrangian Multipliers (MLM).
Instead of naively updating the network parameters by only minimizing the loss, this algorithm:
  • Explicitly tracks the primal variables (the reconstruction variables u ( t ) ),
  • And the dual variables (the Lagrange multipliers λ t ),
to ensure that the updates respect the dynamic constraint at each iteration:
u ( t + 1 ) = g ( u ( t ) , θ t ) ,
where g denotes the learned update network and θ t are the trainable control parameters at iteration t.
The forward pass simulates the network behavior over T iterations (like "unrolling" the optimization).
The backward pass computes how the loss at the end depends on every single step (using the chain rule and multipliers). The parameter update ensures that the learned optimizer g not only reduces loss but respects the dynamic constraints (i.e., how u ( t ) evolves over time).
Thus, it mimics optimal control theory where we treat the evolution of the system as a dynamic process, not just static minimization.

4. Discussion

Optimization is fundamental to solving inverse problems, yet traditional methods often require careful tuning and lack flexibility across domains. Meta-learning addresses this by enabling optimizers to learn from data how to adaptively solve new tasks, capturing task-specific structures, and accelerating convergence[16,17]. This is especially important in applications such as quantitative MRI reconstruction, where diverse data distributions[18] require robust and adaptable optimization strategies.
Optimization and meta-learning are increasingly intertwined with broader developments in decision-making under uncertainty. Recent works in predict-then-optimize frameworks[19], fairness in resource allocation[20], and integrated estimation-optimization perspectives[21] highlight the importance of aligning learning models with downstream optimization goals[22,23,24].

5. Conclusion

In this work, we proposed a bilevel optimization framework for meta-learning, where the training process is viewed through the lens of optimal control theory. By employing the Method of Lagrangian Multipliers (MLM), we explicitly modeled both the primal reconstruction variables and the dual Lagrange multipliers, enabling principled updates that respect the underlying dynamics of the optimization process[25]. Our approach bridges classical optimization techniques with modern deep learning, offering a scalable and theoretically grounded method for learning adaptive optimization strategies.
This formulation not only improves the training stability but also enhances generalization across diverse tasks and domains, making it particularly suitable for challenging inverse problems such as quantitative MRI reconstruction. The equivalence established between direct loss minimization and Lagrangian-based optimization further validates the correctness and efficiency of our method.
Future work will explore extending this framework to more complex scenarios, such as multi-stage reconstruction, uncertainty-aware modeling, and real-time adaptation in dynamic environments. We believe that optimization-driven meta-learning approaches like ours hold strong potential for advancing learning systems in both medical imaging and broader scientific computing domains.

References

  1. T. M. Hospedales, A. T. M. Hospedales, A. Antoniou, P. Micaelli, and A. J. Storkey, “Meta-learning in neural networks: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. [CrossRef]
  2. M. Huisman, J. N. M. Huisman, J. N. van Rijn, and A. Plaat, “A survey of deep meta-learning,” Artificial Intelligence Review, pp. 1–59, 2021. [CrossRef]
  3. C. Finn, A. C. Finn, A. Rajeswaran, S. Kakade, and S. Levine, “Online meta-learning,” in International Conference on Machine Learning. PMLR, 2019, pp. 1920–1930. [CrossRef]
  4. W. Bian, Y. W. Bian, Y. Chen, and X. Ye, “An optimal control framework for joint-channel parallel mri reconstruction without coil sensitivities,” Magnetic Resonance Imaging, 2022. [CrossRef]
  5. D. Kiyasseh, A. D. Kiyasseh, A. Swiston, R. Chen, and A. Chen, “Segmentation of left atrial mr images via self-supervised semi-supervised meta-learning,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, –October 1, 2021, Proceedings, Part II 24. Springer, 2021, pp. 13–24. 27 September. [CrossRef]
  6. W. Bian, A. W. Bian, A. Jang, and F. Liu, “Multi-task magnetic resonance imaging reconstruction using meta-learning,” Magnetic Resonance Imaging, vol. 116, p. 110278, 2025. [CrossRef]
  7. Y. Chen, C.-B. Y. Chen, C.-B. Schönlieb, P. Liò, T. Leiner, P. L. Dragotti, G. Wang, D. Rueckert, D. Firmin, and G. Yang, “Ai-based reconstruction for fast mri—a systematic review and meta-analysis,” Proceedings of the IEEE, vol. 110, no. 2, pp. 224–245, 2022. [CrossRef]
  8. W. Bian, Y. W. Bian, Y. Chen, X. Ye, and Q. Zhang, “An optimization-based meta-learning model for mri reconstruction with diverse dataset,” Journal of Imaging, vol. 7, no. 11, p. 231, 2021. [CrossRef]
  9. Q. Liu, Q. Q. Liu, Q. Dou, and P.-A. Heng, “Shape-aware meta-learning for generalizing prostate mri segmentation to unseen domains,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, –8, 2020, Proceedings, Part II 23. Springer, 2020, pp. 475–485. 4 October. [CrossRef]
  10. W. Bian, Q. W. Bian, Q. Zhang, X. Ye, and Y. Chen, “A learnable variational model for joint multimodal mri reconstruction and synthesis,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2022, pp. 354–364. [CrossRef]
  11. W. Bian and Y. K. Tamilselvam, “A review of optimization-based deep learning models for mri reconstruction,” AppliedMath, vol. 4, no. 3, pp. 1098–1127, 2024. [CrossRef]
  12. A. Rajeswaran, C. A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine, “Meta-learning with implicit gradients,” Advances in neural information processing systems, vol. 32, 2019. [CrossRef]
  13. S. Wang, J. S. Wang, J. Chen, X. Deng, S. Hutchinson, and F. Dellaert, “Robot calligraphy using pseudospectral optimal control in conjunction with a novel dynamic brush model,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 6696–6703. [CrossRef]
  14. W. Bian, “Optimization-based deep learning methods for magnetic resonance imaging reconstruction and synthesis,” Ph.D. dissertation, University of Florida, 2022.
  15. arXiv:2406.02626, 2024.
  16. Z. Ding, P. Z. Ding, P. Li, Q. Yang, S. Li, and Q. Gong, “Regional style and color transfer,” in 2024 5th International Conference on Computer Vision, Image and Deep Learning (CVIDL). IEEE, 2024, pp. 593–597. [CrossRef]
  17. W. Bian, A. W. Bian, A. Jang, L. Zhang, X. Yang, Z. Stewart, and F. Liu, “Diffusion modeling with domain-conditioned prior guidance for accelerated mri and qmri reconstruction,” IEEE Transactions on Medical Imaging, 2024. [CrossRef]
  18. Z. Ke, S. Z. Ke, S. Zhou, Y. Zhou, C. H. Chang, and R. arXiv preprint arXiv:2501.07033, arXiv:2501.07033, 2025.
  19. S. Verma, Y. S. Verma, Y. Zhao, S. Shah, N. Boehmer, A. Taneja, and M. Tambe, “Group fairness in predict-then-optimize settings for restless bandits,” in The 40th Conference on Uncertainty in Artificial Intelligence.
  20. N. Boehmer, Y. N. Boehmer, Y. Zhao, G. Xiong, P. Rodriguez-Diaz, P. D. C. Cibrian, J. Ngonzi, A. Boatin, and M. Tambe, “Optimizing vital sign monitoring in resource-constrained maternal care: An rl-based restless bandit approach,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 28, 2025, pp. 28 843–28 849. [CrossRef]
  21. A. N. Elmachtoub, H. A. N. Elmachtoub, H. Lam, H. Zhang, and Y. arXiv preprint arXiv:2304.06833, arXiv:2304.06833, vol. 107, 2023.
  22. W. Bian, P. W. Bian, P. Li, M. Zheng, C. Wang, A. Li, Y. Li, H. Ni, and Z. Zeng, “A review of electromagnetic elimination methods for low-field portable mri scanner,” in 2024 5th International Conference on Machine Learning and Computer Application (ICMLCA). IEEE, 2024, pp. 614–618. [CrossRef]
  23. Z. Li, S. Z. Li, S. Qiu, and Z. Ke, “Revolutionizing drug discovery: Integrating spatial transcriptomics with advanced computer vision techniques,” in 1st CVPR Workshop on Computer Vision For Drug Discovery (CVDD): Where are we and What is Beyond?
  24. Z. Ke and Y. Yin, “Tail risk alert based on conditional autoregressive var by regression quantiles and machine learning algorithms,” in 2024 5th International Conference on Artificial Intelligence and Computer Engineering (ICAICE). IEEE, 2024, pp. 527–532. [CrossRef]
  25. W. Bian, A. W. Bian, A. Jang, and F. Liu, “Improving quantitative mri using self-supervised deep learning with model reinforcement: Demonstration for rapid t1 mapping,” Magnetic Resonance in Medicine, vol. 92, no. 1, pp. 98–111, 2024. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated