Fine-Tuning Transformers Efficiently: A Survey on LoRA and Its Impact

Muchen Huan; Jianhong Shun

doi:10.20944/preprints202502.1637.v1

Submitted:

19 February 2025

Posted:

20 February 2025

You are already at the latest version

Abstract

The rapid growth of Large Language Models (LLMs) has revolutionized natural language processing (NLP), enabling remarkable advancements in text generation, machine translation, and various downstream applications. However, fine-tuning these models remains computationally expensive due to their vast number of parameters. Low-Rank Adaptation (LoRA) has emerged as a highly efficient parameter-efficient fine-tuning (PEFT) technique that significantly reduces memory and computational costs while maintaining competitive performance. LoRA achieves this by freezing the pre-trained model weights and introducing trainable low-rank matrices into transformer layers, enabling efficient adaptation to new tasks. This survey provides a comprehensive review of LoRA, covering its theoretical foundations, practical implementation, recent advancements, and real-world applications. We explore various hybrid approaches that combine LoRA with other fine-tuning techniques, such as prompt tuning and adapter layers, as well as extensions like dynamic rank selection and quantized LoRA for enhanced efficiency. Additionally, we discuss the application of LoRA beyond traditional NLP tasks, including vision-language models, speech processing, and reinforcement learning. Despite its advantages, LoRA presents challenges such as inference overhead and optimal rank selection, which remain active areas of research. We highlight ongoing efforts to address these limitations and discuss future directions, including automated LoRA optimization, continual learning, and deployment in ultra-large foundation models. As AI models continue to grow in complexity, LoRA stands out as a scalable and cost-effective solution for fine-tuning, making it an essential tool for researchers and practitioners seeking to adapt LLMs efficiently.

Keywords:

low-rank adaptation

;

large language models

;

parameter-efficient fine-tuning

;

deep learning

;

transfer learning

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

I. Introduction

Large Language Models (LLMs) have transformed the field of natural language processing (NLP), demonstrating unprecedented capabilities in a wide range of applications, including machine translation, text summarization, dialogue systems, and code generation [1]. Models such as OpenAI’s GPT series, Google’s T5, Meta’s LLaMA, and various other transformer-based architectures have set new benchmarks across numerous NLP tasks. The success of these models is largely attributed to their massive scale, often consisting of billions or even trillions of parameters, trained on vast amounts of text data [2]. However, this rapid advancement has also introduced significant challenges, particularly concerning the fine-tuning and adaptation of such models for specific downstream tasks. Traditional full fine-tuning approaches require updating all model parameters for each new task, leading to high computational costs, substantial memory usage, and long training times [3]. These requirements make fine-tuning infeasible for many researchers and organizations, especially those with limited access to high-performance computing resources [4]. Moreover, as LLMs continue to grow in size, the challenges associated with model adaptation become even more pronounced. This has led to an increasing demand for parameter-efficient fine-tuning methods that enable effective adaptation without the exorbitant computational overhead. One of the most promising techniques to address these challenges is Low-Rank Adaptation (LoRA). LoRA introduces a novel approach to fine-tuning by injecting trainable low-rank matrices into existing weight matrices of LLMs, thereby significantly reducing the number of trainable parameters while maintaining competitive performance [5]. Instead of modifying the full set of model weights, LoRA applies a low-rank decomposition to weight updates, allowing for efficient adaptation without excessive memory or compute requirements [6]. This technique not only reduces the cost of fine-tuning but also facilitates rapid task adaptation, making it particularly useful in scenarios where multiple specialized models are required. The key advantages of LoRA lie in its efficiency, modularity, and versatility [7]. By reducing the trainable parameter count by orders of magnitude compared to full fine-tuning, LoRA enables organizations to fine-tune large models using significantly less GPU memory [8]. This makes it possible to adapt LLMs on consumer-grade hardware, democratizing access to powerful AI models [9]. Additionally, since LoRA operates by introducing small modifications to existing layers rather than overwriting entire model weights, it allows for efficient storage and reusability of fine-tuned adaptations. Multiple task-specific adapters can be stored and switched dynamically, further enhancing the model’s flexibility. LoRA has been successfully applied across various domains, including NLP, computer vision, and speech processing [10]. In NLP, it has been used for domain adaptation, sentiment analysis, personalized AI assistants, and knowledge retrieval [11]. Its ability to fine-tune models efficiently while retaining the benefits of large-scale pretraining has made it a crucial tool in both research and industry settings. Moreover, LoRA has been combined with other techniques, such as Prompt Tuning and Prefix Tuning, to further enhance its effectiveness [12]. Recent studies have explored hybrid approaches, demonstrating how LoRA can be integrated with other parameter-efficient tuning (PET) methods to balance efficiency and performance [13]. Despite its advantages, LoRA is not without limitations [14]. One challenge is the trade-off between adaptation efficiency and expressive power. Since LoRA restricts modifications to a low-rank subspace, it may struggle with certain complex tasks that require extensive parameter updates [15]. Additionally, while LoRA reduces memory requirements during training, inference-time efficiency remains an area of active research [16]. Addressing these limitations requires exploring more sophisticated methods of incorporating LoRA into transformer architectures, such as dynamic rank selection, structured pruning, and adaptive tuning strategies [15,17]. This survey provides a comprehensive and structured review of LoRA in the context of LLMs. We begin by discussing the theoretical foundations of LoRA, including its mathematical formulation and underlying principles. We then examine its practical implementations, comparing it with alternative fine-tuning methods and evaluating its impact across various NLP tasks [18]. Furthermore, we highlight recent advancements in LoRA research, including hybrid approaches and optimizations aimed at enhancing its effectiveness. Finally, we explore open challenges and future research directions, such as improving LoRA’s applicability to different architectures, optimizing its efficiency for real-time applications, and developing better strategies for balancing performance and computational cost [19]. By consolidating the latest research and insights on LoRA, this survey aims to serve as a valuable resource for researchers, engineers, and practitioners seeking to optimize the adaptation of LLMs. As the demand for efficient fine-tuning methods continues to grow, understanding and leveraging techniques like LoRA will be critical in enabling scalable, accessible, and cost-effective deployment of large-scale AI models [20].

II. Background and Related Work

The rapid advancements in Large Language Models (LLMs) have necessitated the development of efficient fine-tuning techniques to adapt these models to various downstream tasks. Traditional fine-tuning, which involves updating all parameters of a pre-trained model, has proven to be computationally expensive and memory-intensive, especially as model sizes continue to grow [21]. This has led to the exploration of parameter-efficient fine-tuning (PEFT) methods, among which Low-Rank Adaptation (LoRA) has gained significant attention. In this section, we provide an overview of foundational concepts related to LLM fine-tuning, discuss alternative PEFT approaches, and highlight the key developments that have led to the adoption of LoRA [22].

A. Fine-Tuning of Large Language Models

Fine-tuning is the process of adapting a pre-trained LLM to a specific task or domain by further training it on task-specific data [23]. The most common approach, known as full fine-tuning, involves updating all model parameters. While effective, this method has several drawbacks:

High computational cost: Updating billions of parameters requires extensive GPU memory and processing power [24].
Storage inefficiency: Each fine-tuned model requires storing a full copy of the modified weights, making it infeasible to maintain multiple task-specific models [25].
Catastrophic forgetting: Adapting a model to one task may degrade its performance on previously learned tasks if not handled carefully.

These challenges have motivated researchers to explore alternative fine-tuning strategies that are more efficient while preserving the benefits of pre-trained LLMs.

B. Parameter-Efficient Fine-Tuning (PEFT) Approaches

Several parameter-efficient fine-tuning techniques have been proposed to reduce the computational and storage burdens associated with full fine-tuning [26]. Some of the most notable approaches include:

1) Adapter Layers

Adapter layers introduce small, task-specific modules into the transformer architecture while keeping the original model parameters frozen [27]. These lightweight layers are trained while the base model remains unchanged, significantly reducing memory usage [28]. Adapter-based methods allow for efficient multi-task learning, as different adapters can be swapped in and out without modifying the core model. However, they require additional forward-pass computations during inference, which can slightly increase latency.

2) Prompt Tuning and Prefix Tuning

Prompt tuning modifies a model’s input rather than its parameters [29]. This approach involves learning a small set of tunable prompt embeddings that guide the model’s responses without altering its internal weights [30]. Prefix tuning extends this idea by prepending trainable continuous embeddings to the model’s input representations [31]. While effective for certain tasks, these methods often require large amounts of data to match the performance of traditional fine-tuning [32].

3) BitFit

BitFit is an extremely lightweight fine-tuning method that updates only the bias terms of the model’s parameters while keeping all other weights frozen. This reduces the number of trainable parameters by several orders of magnitude. Although BitFit works well for some classification tasks, it may struggle with complex generative tasks that require deeper model adaptations.

4) Low-Rank Adaptation (LoRA)

LoRA introduces trainable low-rank matrices into existing weight matrices of an LLM, allowing for efficient adaptation while keeping most of the model’s parameters frozen [33]. By restricting updates to a low-rank subspace, LoRA drastically reduces the number of trainable parameters, making it one of the most effective PEFT techniques. It maintains the pre-trained model’s knowledge while enabling fast and cost-efficient adaptation [34]. LoRA has been widely adopted due to its balance of efficiency and effectiveness across multiple NLP tasks.

C. Evolution of LoRA and Its Adoption in NLP

The development of LoRA was driven by the need for scalable fine-tuning solutions that mitigate the challenges of large-scale models [35]. LoRA’s effectiveness has been demonstrated in various studies, showing that it can achieve performance comparable to full fine-tuning while significantly reducing computational costs [36]. Recent research has explored ways to enhance LoRA further, including hybrid approaches that combine LoRA with other PEFT methods, adaptive rank selection strategies, and optimizations for better inference efficiency. LoRA has been successfully applied in a variety of domains, including:

Natural Language Understanding (NLU): Tasks such as sentiment analysis, named entity recognition, and text classification benefit from LoRA’s ability to fine-tune LLMs efficiently.
Text Generation: LoRA has been integrated into large autoregressive models like GPT to improve domain-specific text generation while maintaining fluency and coherence [37].
Multimodal Applications: Recent work has extended LoRA to multimodal models, enabling efficient adaptation of vision-language models for tasks such as image captioning and visual question answering.

D. Summary

In this section, we provided an overview of the challenges associated with full fine-tuning of LLMs and introduced various parameter-efficient fine-tuning methods. Among these, LoRA has emerged as a leading approach due to its balance between efficiency and performance [38]. In the following sections, we delve deeper into the mathematical foundations of LoRA, its practical implementations, and recent advancements that have further enhanced its effectiveness in adapting LLMs to diverse tasks [39].

III. Mathematical Foundations of LoRA

Low-Rank Adaptation (LoRA) is grounded in the principle of low-rank matrix approximation, which enables efficient fine-tuning of Large Language Models (LLMs) by reducing the number of trainable parameters [40]. This section presents the mathematical formulation of LoRA, detailing its core principles, theoretical justifications, and how it integrates into transformer architectures [41].

A. Low-Rank Decomposition in Neural Networks

In traditional fine-tuning, the weight matrices of a neural network are fully updated during training [42]. However, LoRA assumes that the weight updates during adaptation reside in a low-rank subspace, allowing for a compact representation [43]. Mathematically, let

W \in R^{d \times k}

be a weight matrix in an LLM, where d is the input dimension and k is the output dimension [44]. Instead of updating

W

directly, LoRA models the weight update as:

Δ W = A B,

(1)

where

A \in R^{d \times r}

and

B \in R^{r \times k}

are the low-rank matrices, and

r ≪ min (d, k)

is the rank of the decomposition [45]. The base model parameters remain frozen while only

A

and

B

are trained, leading to a significant reduction in the number of trainable parameters [46].

B. Parameter Efficiency and Complexity Reduction

The total number of trainable parameters in standard fine-tuning is

O (d k)

. With LoRA, the number of trainable parameters reduces to:

O (d r + r k) = O (r (d + k)) [47] .

(2)

Since r is much smaller than d and k, this leads to a substantial reduction in computational cost. For example, if

r = 8

in a model with millions of parameters per layer, the storage and training efficiency improve dramatically without significantly impacting performance [48].

C. Integration with Transformer Architectures

LoRA is typically applied to key layers in transformer architectures, such as the self-attention mechanism [49]. In a standard transformer, the attention mechanism computes the output as:

Attention (Q, K, V) = softmax (\frac{{QK}^{T}}{\sqrt{d_{k}}}) V,

(3)

where

Q = {XW}_{Q}

,

K = {XW}_{K}

, and

V = {XW}_{V}

are the query, key, and value projections, respectively. In LoRA, the weight matrices

W_{Q}

and

W_{V}

are modified as:

W_{Q}^{'} = W_{Q} + A_{Q} B_{Q}, W_{V}^{'} = W_{V} + A_{V} B_{V} .

(4)

Since the base weights

W_{Q}

and

W_{V}

remain frozen, LoRA introduces minimal computational overhead while allowing for effective task adaptation [50].

D. Rank Selection and Performance Trade-Offs

The choice of the rank r in LoRA is crucial for balancing efficiency and model expressiveness. A higher rank allows the adaptation process to capture more complex transformations but increases the number of trainable parameters [51]. Empirical studies suggest that even low-rank settings (

r = 4

or

r = 8

) can achieve performance comparable to full fine-tuning in many NLP tasks [52]. LoRA’s effectiveness can be further understood through singular value decomposition (SVD) [53]. Given a full-rank weight update matrix

Δ W

, its optimal low-rank approximation is obtained by truncating its singular value decomposition:

Δ W = U Σ V^{T} \approx U_{r} Σ_{r} V_{r}^{T} [54] .

(5)

This insight highlights that LoRA captures the most significant directions of variation while discarding less critical components [55].

E. Comparison with Other Fine-Tuning Methods

LoRA provides several advantages over other fine-tuning approaches:

Storage efficiency: Since only the low-rank matrices are stored, multiple task-specific adaptations can be maintained without redundant full model copies.
Reduced computational cost: Training requires fewer parameters to be updated, leading to faster convergence and lower memory consumption [56].
Preservation of pre-trained knowledge: By keeping the original model weights frozen, LoRA avoids catastrophic forgetting and enables easy model reversibility [57].

F. Summary

LoRA leverages low-rank matrix decomposition to achieve efficient fine-tuning of LLMs while maintaining competitive performance [58]. Its ability to integrate seamlessly with transformer-based architectures, combined with its storage and computational benefits, makes it a powerful tool for scalable adaptation of large models [59]. In the next section, we will explore practical implementations of LoRA, discussing real-world applications, optimization techniques, and empirical performance evaluations [60].

IV. Practical Implementation of LoRA

While the mathematical foundations of Low-Rank Adaptation (LoRA) provide an efficient framework for fine-tuning Large Language Models (LLMs), its practical implementation requires careful integration into training pipelines, optimization strategies, and real-world deployment scenarios [61]. In this section, we discuss the practical aspects of implementing LoRA, including its integration with existing deep learning frameworks, training strategies, evaluation methodologies, and applications in various domains.

A. Integrating LoRA into Deep Learning Frameworks

LoRA has been widely adopted in popular deep learning libraries, making it accessible for researchers and practitioners. Several frameworks provide built-in support for LoRA, including:

Hugging Face Transformers: The Hugging Face library provides APIs to integrate LoRA with models such as GPT, BERT, and T5, enabling efficient fine-tuning with minimal modifications [62].
PyTorch LoRA Implementations: Several PyTorch-based implementations, such as peft (Parameter Efficient Fine-Tuning), provide easy-to-use modules for applying LoRA to transformer layers.
TensorFlow and JAX Support: Although less common, LoRA implementations exist for TensorFlow and JAX, allowing for efficient adaptation of LLMs within these ecosystems.

To implement LoRA in practice, developers typically modify transformer layers by introducing low-rank matrices into key projection layers [63]. A simple PyTorch implementation involves replacing standard linear layers with LoRA-adapted layers:

import torch

import torch.nn as nn

class LoRALinear(nn.Module):

def __init__(self, in_features, out_features, rank=8):

super().__init__()

self.base_layer = nn.Linear(in_features, out_features, bias=False)

self.A = nn.Linear(in_features, rank, bias=False)

self.B = nn.Linear(rank, out_features, bias=False)

self.B.weight.data.zero_() # Initialize B to avoid large initial updates

def forward(self, x):

return self.base_layer(x) + self.B(self.A(x))

This implementation showcases how LoRA modifies a standard linear layer while keeping the original weight matrix frozen [64].

B. Training Strategies for LoRA

Training an LLM with LoRA requires optimizing the low-rank matrices while keeping the original model weights unchanged [65]. The following strategies enhance the effectiveness of LoRA training:

1) Optimizing the Learning Rate

Since LoRA significantly reduces the number of trainable parameters, standard learning rates used for full fine-tuning may not be optimal [66]. Lower learning rates often lead to more stable convergence, while adaptive learning rate schedules (e.g., cosine annealing or warm-up schedules) improve fine-tuning efficiency [67].

2) Gradient Accumulation and Mixed Precision Training

To further optimize training, LoRA can be combined with:

Gradient accumulation: Reduces memory usage by updating gradients over multiple mini-batches [68].
Mixed precision training: Uses lower precision (e.g., FP16 or BF16) for faster computation and reduced memory consumption [69].

3) Task-Specific Adaptation

LoRA is highly effective for domain adaptation and task specialization. Instead of training a separate model for each task, LoRA allows multiple task-specific adapters to be stored and swapped dynamically [70]. For example, in multi-task learning scenarios, different low-rank matrices can be loaded on demand without requiring full retraining [71].

C. Evaluation and Benchmarking

Assessing the effectiveness of LoRA requires rigorous benchmarking against other fine-tuning methods [72]. Common evaluation metrics include:

Perplexity (PPL): Measures how well the fine-tuned model predicts test data, commonly used in language modeling tasks [73].
Accuracy and F1-score: Standard metrics for classification tasks, such as sentiment analysis or named entity recognition [74].
BLEU and ROUGE scores: Used for text generation and summarization tasks to evaluate output quality [75].
Computational efficiency: GPU memory usage, training speed, and inference latency are key factors in evaluating LoRA’s efficiency [76].

D. Real-World Applications of LoRA

LoRA has been successfully applied in various domains, demonstrating its versatility and efficiency [77]. Notable applications include:

1) Natural Language Processing (NLP)

Chatbots and Virtual Assistants: LoRA enables fast adaptation of conversational AI models to specific industries (e.g., healthcare, customer service) [78].
Machine Translation: By fine-tuning pre-trained models like mBART, LoRA improves translation quality without excessive computational costs [79].
Legal and Financial Text Processing: LoRA has been used to adapt LLMs for specialized jargon-heavy domains, such as legal document summarization [80].

2) Computer Vision

Recent research has extended LoRA to vision-language models (e.g., CLIP, BLIP), allowing efficient fine-tuning of models for image captioning and multimodal tasks [81].

3) Biomedical and Healthcare Applications

Medical Text Analysis: LoRA has been used to fine-tune BERT-based models for tasks such as clinical report generation and medical coding [82].
Drug Discovery: AI-driven molecular property prediction models benefit from LoRA’s efficiency in adapting transformer-based architectures [83].

4) Code Generation and Programming Assistance

LoRA has been applied to fine-tune models like CodeT5 and StarCoder, enhancing their ability to generate code, provide bug fixes, and assist developers in specialized programming languages [84].

E. Challenges and Best Practices

While LoRA offers significant advantages, its implementation is not without challenges [85]. Key considerations include:

Rank Selection: Choosing an appropriate rank r is crucial for maintaining a balance between efficiency and expressiveness.
Memory Efficiency: While LoRA reduces training costs, inference efficiency remains an area of active research [86].
Hybrid Fine-Tuning Approaches: Combining LoRA with other techniques, such as prompt tuning and adapter layers, can further improve performance [87].

F. Summary

LoRA has emerged as a powerful and practical approach for fine-tuning LLMs efficiently. Its seamless integration into popular deep learning frameworks, coupled with its reduced computational footprint, makes it an ideal solution for a wide range of applications [88]. The next section explores recent advancements and ongoing research aimed at further enhancing LoRA’s effectiveness and expanding its use cases [89].

V. Recent Advancements and Ongoing Research

The success of Low-Rank Adaptation (LoRA) has sparked extensive research into further improving its efficiency, applicability, and performance [90]. While LoRA has already demonstrated significant advantages in fine-tuning large language models (LLMs), ongoing studies continue to explore new optimizations, hybrid approaches, and theoretical enhancements [91]. This section discusses recent advancements in LoRA-based fine-tuning, including extensions of LoRA, its combination with other parameter-efficient tuning (PET) methods, and novel applications beyond traditional NLP tasks [92].

A. Hybrid Approaches: Combining LoRA with Other Fine-Tuning Techniques

While LoRA significantly reduces the number of trainable parameters, researchers have explored hybrid approaches that combine LoRA with other fine-tuning techniques to maximize efficiency and flexibility [93]. Some notable hybrid methods include:

1) LoRA + Prompt Tuning

Prompt tuning involves learning small continuous embeddings that modify the model’s input rather than its parameters [94]. Recent work has combined LoRA with prompt tuning to further reduce the adaptation footprint while maintaining competitive performance [95]. This approach is particularly useful in scenarios where fast task switching is required, such as multi-domain chatbots.

2) LoRA + Prefix Tuning

Prefix tuning extends prompt tuning by introducing learnable embeddings into the model’s intermediate representations rather than just the input layer [96]. When used alongside LoRA, this method allows for a balance between expressiveness and computational efficiency, leading to improved results in generative tasks such as machine translation and text summarization.

3) LoRA + Adapter Layers

Adapter layers are small trainable modules inserted within transformer blocks [97]. By combining LoRA with adapter layers, researchers have achieved enhanced model adaptability with minimal memory overhead. This hybrid technique is particularly useful in multilingual NLP, where different adapters can be used for different languages while LoRA fine-tunes shared knowledge [98].

B. Adaptive Rank Selection and Dynamic LoRA

The effectiveness of LoRA is closely tied to the choice of rank r. Traditionally, LoRA uses a fixed rank across all model layers [99]. However, recent research has introduced adaptive rank selection, which dynamically adjusts the rank based on layer importance and task complexity. Some key developments in this area include:

Layer-wise Rank Allocation: Instead of assigning a uniform rank to all transformer layers, models can be optimized by using higher ranks in critical layers (e.g., deeper attention layers) and lower ranks in less important layers [100].
Task-Specific Rank Optimization: Algorithms such as evolutionary search or reinforcement learning can be employed to find optimal rank configurations for different tasks [101].
Sparse LoRA: Some studies propose sparsifying the low-rank matrices to further reduce computational requirements while preserving model accuracy [102].

C. LoRA for Multimodal and Cross-Domain Applications

While LoRA has primarily been used in NLP tasks, recent work has explored its application in multimodal learning and cross-domain adaptation [103]. Notable advancements include:

1) LoRA for Vision-Language Models

Vision-language models (VLMs) such as CLIP, BLIP, and Flamingo have demonstrated strong zero-shot learning capabilities [104]. However, fine-tuning these models for domain-specific tasks remains challenging due to their size. LoRA has been successfully integrated into VLMs to enable efficient adaptation for applications such as:

Image captioning with domain-specific knowledge [105].
Video understanding for automated content analysis [106].
Visual question answering (VQA) in specialized fields like medical imaging [107].

2) LoRA for Speech and Audio Processing

Recent studies have explored LoRA’s effectiveness in fine-tuning speech recognition and audio generation models. By integrating LoRA into transformer-based architectures such as Whisper or Wav2Vec, researchers have achieved low-cost adaptation for tasks like:

Domain-specific speech recognition (e.g., medical or legal transcription) [108].
Emotion-aware conversational AI [109].
Personalized text-to-speech (TTS) systems.

3) LoRA for Reinforcement Learning and Robotics

Beyond NLP and multimodal applications, LoRA has been investigated in reinforcement learning (RL) settings. Recent work has demonstrated that LoRA can be applied to fine-tune policies in transformer-based RL agents, enabling:

More efficient policy adaptation in large-scale RL environments [110].
Parameter-efficient tuning of foundation models for robotics.
Domain-specific adaptation for embodied AI systems [111].

D. Optimizing LoRA for Efficient Inference

While LoRA significantly reduces training costs, its impact on inference efficiency remains an active area of research [112]. Some recent optimizations aimed at improving inference include:

1) Quantized LoRA

Quantization reduces the precision of model weights to lower-bit representations (e.g., INT8 or FP16) to decrease memory usage and speed up inference [113]. Researchers have explored quantized versions of LoRA to make it even more lightweight, particularly for edge AI applications [114].

2) Fusion of LoRA Adapters

In cases where multiple LoRA adapters are trained for different tasks, researchers have explored methods to merge these adapters into a single model without requiring multiple forward passes [115]. This is particularly useful in multi-task learning scenarios where the model must handle diverse inputs efficiently.

3) LoRA for On-Device AI

Recent work has explored LoRA’s role in enabling on-device fine-tuning of language models for personalized AI assistants. By leveraging LoRA’s low-memory footprint, models can be fine-tuned on consumer hardware such as smartphones and IoT devices[53,116,117].

E. Theoretical Insights into LoRA’s Effectiveness

Several studies have attempted to provide deeper theoretical justifications for why LoRA works so well [118]. Key findings include:

LoRA and Model Overparameterization: Research suggests that large language models contain redundant parameters, making them well-suited for low-rank adaptations [119].
Information Flow in LoRA-Modified Networks: Studies analyzing LoRA-modified transformers indicate that low-rank updates primarily affect key subspaces responsible for task-specific information [120].
Optimization Landscapes with LoRA: Some researchers have analyzed LoRA’s impact on the optimization landscape, showing that it enables more stable convergence compared to full fine-tuning [121].

F. Challenges and Future Directions

Despite its many advantages, LoRA has several limitations that warrant further investigation [122]. Key challenges and potential research directions include:

LoRA for Highly Specialized Tasks: While LoRA works well for many tasks, certain applications requiring extensive parameter updates may benefit from hybrid approaches [123].
Reducing Inference Overhead: Although LoRA is efficient during training, methods to optimize inference without introducing additional latency remain an open research question [124].
Automated LoRA Configuration: Developing algorithms that automatically determine the optimal rank and layer placement for LoRA in different architectures can further enhance its usability [125].
Expanding LoRA Beyond Transformers: Most research has focused on transformers, but exploring LoRA’s applicability to other architectures, such as CNNs and RNNs, could broaden its impact.

G. Summary

LoRA continues to evolve as a leading parameter-efficient fine-tuning technique [126]. Recent advancements have expanded its capabilities through hybrid approaches, adaptive rank selection, and novel applications beyond NLP [127]. Ongoing research into inference optimization, theoretical foundations, and cross-domain adaptation will further enhance LoRA’s role in efficient deep learning. In the next section, we conclude with a discussion on LoRA’s long-term impact and potential future developments [128].

VI. Conclusion and Future Perspectives

Low-Rank Adaptation (LoRA) has emerged as a transformative approach to fine-tuning Large Language Models (LLMs) with significantly reduced computational cost and memory requirements [129]. By leveraging low-rank matrix decompositions, LoRA enables efficient adaptation of large-scale pre-trained models without the need to update or store the entire set of parameters [130]. This survey has explored the theoretical foundations, practical implementations, recent advancements, and ongoing research related to LoRA, highlighting its effectiveness across various domains, including natural language processing, vision-language tasks, speech processing, and reinforcement learning.

A. Key Takeaways

The following are the key insights derived from this survey:

Parameter Efficiency: LoRA drastically reduces the number of trainable parameters by decomposing weight updates into low-rank matrices, making fine-tuning feasible for large-scale models.
Computational and Memory Benefits: By keeping the original model weights frozen, LoRA significantly lowers the GPU memory footprint and accelerates training compared to full fine-tuning [131].
Seamless Integration: LoRA has been successfully integrated into widely-used deep learning frameworks such as PyTorch and Hugging Face Transformers, facilitating its adoption by the research and industry communities.
Hybrid and Adaptive Techniques: Recent advancements, such as combining LoRA with prompt tuning, adapter layers, and dynamic rank selection, have further improved its flexibility and effectiveness [132].
Multimodal and Cross-Domain Applications: LoRA has extended beyond NLP and is now being explored in vision-language models, speech processing, and even reinforcement learning.
Inference-Time Considerations: While LoRA optimizes training efficiency, reducing inference overhead remains an important area of research.

B. Future Directions

Despite its impressive benefits, LoRA still presents several open challenges that warrant further exploration [133]. Some promising directions for future research include:

1) Automated LoRA Optimization

Choosing the optimal rank and identifying the most suitable layers for LoRA adaptation remains a manual process in most implementations. Future research could focus on automated methods for rank selection, possibly using reinforcement learning or neural architecture search techniques [134].

2) Reducing Inference Overhead

While LoRA significantly improves training efficiency, it introduces additional computations at inference time due to the added low-rank matrices [135]. Efficient inference techniques, such as matrix fusion or adaptive LoRA integration, could help mitigate this issue [136].

3) Expanding LoRA Beyond Transformers

Most LoRA implementations are tailored for transformer-based architectures [137]. However, extending LoRA to convolutional neural networks (CNNs), recurrent neural networks (RNNs), and graph neural networks (GNNs) could broaden its applicability to new domains such as computer vision and structured data processing [138].

4) Continual Learning and On-Device Adaptation

LoRA’s lightweight nature makes it well-suited for continual learning and edge AI applications. Future research could explore how LoRA can enable personalized AI assistants, federated learning scenarios, and on-device fine-tuning with limited computational resources [139].

5) LoRA for Foundation Models

As foundation models continue to grow in scale, LoRA could play a crucial role in enabling efficient adaptation without requiring massive computational resources [140]. Future work could investigate how LoRA can be optimized for ultra-large models like GPT-4, PaLM, and Gemini.

C. Final Thoughts

LoRA represents a paradigm shift in the fine-tuning of large models, offering a scalable and efficient alternative to traditional full-parameter adaptation. As AI models continue to expand in size and complexity, parameter-efficient techniques like LoRA will become increasingly crucial for democratizing access to powerful language models. With ongoing research addressing its limitations and expanding its applications, LoRA is poised to remain a cornerstone of efficient model adaptation in the AI landscape.

References

Konstantinidis, T.; Iacovides, G.; Xu, M.; Constantinides, T.G.; Mandic, D.P. Finllama: Financial sentiment classification for algorithmic trading applications. arXiv 2024, arXiv:2403.12285. [Google Scholar]
Zhu, Y.; Wichers, N.; Lin, C.; Wang, X.; Chen, T.; Shu, L.; Lu, H.; Liu, C.; Luo, L.; Chen, J.; et al. Sira: Sparse mixture of low rank adaptation. arXiv 2023, arXiv:2311.09179. [Google Scholar]
Chen, T.; Ding, T.; Yadav, B.; Zharkov, I.; Liang, L. Lorashear: Efficient large language model structured pruning and knowledge recovery. arXiv 2023, arXiv:2310.18356. [Google Scholar]
Chen, Y.; Qian, S.; Tang, H.; Lai, X.; Liu, Z.; Han, S.; Jia, J. Longlora: Efficient fine-tuning of long-context large language models. arXiv 2023, arXiv:2309.12307. [Google Scholar]
Zhang, H. Sinklora: Enhanced efficiency and chat capabilities for long-context large language models. arXiv 2023, arXiv:2406.05678. [Google Scholar]
He, J.; Zhou, C.; Ma, X.; Berg-Kirkpatrick, T.; Neubig, G. Towards a unified view of parameter-efficient transfer learning. In International Conference on Learning Representations; 2022. [Google Scholar]
Meng, X.; Dai, D.; Luo, W.; Yang, Z.; Wu, S.; Wang, X.; Wang, P.; Dong, Q.; Chen, L.; Sui, Z. Periodiclora: Breaking the low-rank bottleneck in lora optimization. arXiv 2024, arXiv:2402.16141. [Google Scholar]
Wang, H.; Xiao, Z.; Li, Y.; Wang, S.; Chen, G.; Chen, Y. Milora: Harnessing minor singular components for parameter-efficient llm finetuning. arXiv 2024, arXiv:2406.09044. [Google Scholar]
Zhang, F.; Pilanci, M. Riemannian preconditioned lora for fine-tuning foundation models. arXiv 2024, arXiv:2402.02347. [Google Scholar]
Gao, C.; Chen, K.; Rao, J.; Sun, B.; Liu, R.; Peng, D.; Zhang, Y.; Guo, X.; Yang, J.; Subrahmanian, V.S. Higher layers need more lora experts. arXiv 2024, arXiv:2402.08562. [Google Scholar]
Gong, Y.; Zhan, Z.; Jin, Q.; Li, Y.; Idelbayev, Y.; Liu, X.; Zharkov, A.; Aberman, K.; Tulyakov, S.; Wang, Y.; et al. E²gan: Efficient training of efficient gans for image-to-image translation. arXiv 2024, arXiv:2401.06127. [Google Scholar]
Qin, H.; Ma, X.; Zheng, X.; Li, X.; Zhang, Y.; Liu, S.; Luo, J.; Liu, X.; Magno, M. Accurate lora-finetuning quantization of llms via information retention. arXiv 2024, arXiv:2402.05445. [Google Scholar]
Yadav, P.; Choshen, L.; Raffel, C.; Bansal, M. Compeft: Compression for communicating parameter efficient updates via sparsification and quantization. arXiv 2023, arXiv:2311.13171. [Google Scholar]
Asadi, N.; Beitollahi, M.; Khalil, Y.H.; Li, Y.; Zhang, G.; Chen, X. Does combining parameter-efficient modules improve few-shot transfer accuracy? arXiv 2024, arXiv:2402.15414. [Google Scholar]
Zhang, M.; Chen, H.; Shen, C.; Yang, Z.; Ou, L.; Yu, X.; Zhuang, B. Loraprune: Pruning meets low-rank parameter-efficient fine-tuning. arXiv 2023, arXiv:2305.18403. [Google Scholar]
Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring massive multitask language understanding. arXiv 2020, arXiv:2009.03300. [Google Scholar]
Zniyed, Y.; Nguyen, T.P.; et al. Efficient tensor decomposition-based filter pruning. Neural Networks 2024, 178, 106393. [Google Scholar]
Liu, Z.; Lyn, J.; Zhu, W.; Tian, X.; Graham, Y. Alora: Allocating low-rank adaptation for fine-tuning large language models. arXiv 2024, arXiv:2403.16187. [Google Scholar]
Xu, Y.; Xie, L.; Gu, X.; Chen, X.; Chang, H.; Zhang, H.; Chen, Z.; Zhang, X.; Tian, Q. Qa-lora: Quantization-aware low-rank adaptation of large language models. arXiv 2023, arXiv:2309.14717. [Google Scholar]
Zhang, Y.; Wang, M.; Wu, Y.; Tiwari, P.; Li, Q.; Wang, B.; Qin, J. Dialoguellm: Context and emotion knowledge-tuned large language models for emotion recognition in conversations. arXiv 2024, arXiv:2310.11374. [Google Scholar]
Wang, H.; Xiang, X.; Fan, Y.; Xue, J. Customizing 360-degree panoramas through text-to-image diffusion models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; 2024; pp. 4933–4943. [Google Scholar]
Zhang, F.; Li, L.; Chen, J.; Jiang, Z.; Wang, B.; Qian, Y. Increlora: Incremental parameter allocation method for parameter-efficient fine-tuning. arXiv 2023, arXiv:2308.12043. [Google Scholar]
Hu, Y.; Xie, Y.; Wang, T.; Chen, M.; Pan, Z. Structure-aware low-rank adaptation for parameter-efficient fine-tuning. Mathematics 2023, 11, 4317. [Google Scholar] [CrossRef]
Ma, Y.; Fan, Y.; Ji, J.; Wang, H.; Sun, X.; Jiang, G.; Shu, A.; Ji, R. X-dreamer: Creating high-quality 3d content by bridging the domain gap between text-to-2d and text-to-3d generation. arXiv 2023, arXiv:2312.00085. [Google Scholar]
He, X.; Li, C.; Zhang, P.; Yang, J.; Wang, X.E. Parameter-efficient model adaptation for vision transformers. In Thirty-Seventh AAAI Conference on Artificial Intelligence; 2023; pp. 817–825. [Google Scholar]
Belofsky, J. Token-level adaptation of lora adapters for downstream task generalization. In 6th Artificial Intelligence and Cloud Computing Conference; 2023; pp. 168–172. [Google Scholar]
Suri, K.; Mishra, P.; Saha, S.; Singh, A. Suryakiran at mediqa-sum 2023: Leveraging lora for clinical dialogue summarization. In Working Notes of the Conference and Labs of the Evaluation Forum; 2023; pp. 1720–1735. [Google Scholar]
Li, S.; Lu, H.; Wu, T.; Yu, M.; Weng, Q.; Chen, X.; Shan, Y.; Yuan, B.; Wang, W. Caraserve: Cpu-assisted and rank-aware lora serving for generative llm inference. arXiv 2024, arXiv:2401.11240. [Google Scholar]
Li, S. Diffstyler: Diffusion-based localized image style transfer. arXiv 2024, arXiv:2403.18461. [Google Scholar]
Miles, R.; Reddy, P.; Elezi, I.; Deng, J. Velora: Memory efficient training using rank-1 sub-token projections. arXiv 2024, arXiv:2405.17991. [Google Scholar]
Pan, R.; Liu, X.; Diao, S.; Pi, R.; Zhang, J.; Han, C.; Zhang, T. LISA: layerwise importance sampling for memory-efficient large language model fine-tuning. arXiv 2024, arXiv:2403.17919. [Google Scholar]
Frank, M.; Wolfe, P.; et al. An algorithm for quadratic programming. Naval research logistics quarterly 1956, 3, 95–110. [Google Scholar] [CrossRef]
Wang, A.; Islam, M.; Xu, M.; Zhang, Y.; Ren, H. SAM meets robotic surgery: An empirical study on generalization, robustness and adaptation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; 2023; pp. 234–244. [Google Scholar]
Gema, A.P.; Daines, L.; Minervini, P.; Alex, B. Parameter-efficient fine-tuning of llama for the clinical domain. arXiv 2023, arXiv:2307.03042. [Google Scholar]
Sui, Y.; Yin, M.; Gong, Y.; Xiao, J.; Phan, H.; Yuan, B. ELRT: efficient low-rank training for compact convolutional neural networks. arXiv 2024, arXiv:2401.10341. [Google Scholar]
Kim, S.; Yang, H.; Kim, Y.; Hong, Y.; Park, E. Hydra: Multi-head low-rank adaptation for parameter efficient fine-tuning. Neural Networks 2024, 106414. [Google Scholar] [CrossRef]
Bhatti, A.; Parmar, S.; Lee, S. SM70: A large language model for medical devices. arXiv 2023, arXiv:2312.06974. [Google Scholar]
Sun, Y.; Li, Z.; Li, Y.; Ding, B. Improving LoRA in privacy-preserving federated learning. arXiv 2024, arXiv:2403.12313. [Google Scholar]
Liu, Y.; An, C.; Qiu, X. Y-tuning: An efficient tuning paradigm for large-scale pre-trained models via label representation learning. Frontiers of Computer Science 2024, 18, 184320. [Google Scholar] [CrossRef]
Li, Y.; Yu, Y.; Liang, C.; He, P.; Karampatziakis, N.; Chen, W.; Zhao, T. Loftq: Lora-fine-tuning-aware quantization for large language models. arXiv 2023, arXiv:2310.08659. [Google Scholar]
Smith, J.S.; Cascante-Bonilla, P.; Arbelle, A.; Kim, D.; Panda, R.; Cox, D.D.; Yang, D.; Kira, Z.; Feris, R.; Karlinsky, L. Construct-vl: Data-free continual structured VL concepts learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023; pp. 14994–15004. [Google Scholar]
Han, A.; Li, J.; Huang, W.; Hong, M.; Takeda, A.; Jawanpuria, P.; Mishra, B. Sltrain: a sparse plus low-rank approach for parameter and memory efficient pretraining. arXiv 2024, arXiv:2406.02214. [Google Scholar]
Ayupov, S.; Chirkova, N. Parameter-efficient finetuning of transformers for source code. arXiv 2022, arXiv:2212.05901. [Google Scholar]
Huang, T.; Zeng, Y.; Zhang, Z.; Xu, W.; Xu, H.; Xu, S.; Lau, R.W.H.; Zuo, W. Dreamcontrol: Control-based text-to-3d generation with 3d self-prior. arXiv 2023, arXiv:2312.06439. [Google Scholar]
Blattmann, A.; Dockhorn, T.; Kulal, S.; Mendelevitch, D.; Kilian, M.; Lorenz, D.; Levi, Y.; English, Z.; Voleti, V.; Letts, A.; et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv 2023, arXiv:2311.15127. [Google Scholar]
Liu, S.; Keung, J.; Yang, Z.; Liu, F.; Zhou, Q.; Liao, Y. Delving into parameter-efficient fine-tuning in code change learning: An empirical study. arXiv 2024, arXiv:2402.06247. [Google Scholar]
Zhang, R.; Qiang, R.; Somayajula, S.A.; Xie, P. Autolora: Automatically tuning matrix ranks in low-rank adaptation based on meta learning. arXiv 2024, arXiv:2403.09113. [Google Scholar]
Bornheim, T.; Grieger, N.; Blaneck, P.G.; Bialonski, S. Speaker attribution in german parliamentary debates with qlora-adapted large language models. arXiv 2024, arXiv:2309.09902. [Google Scholar] [CrossRef]
Yeo, J.H.; Han, S.; Kim, M.; Ro, Y.M. Where visual speech meets language: VSP-LLM framework for efficient and context-aware visual speech processing. arXiv 2024, arXiv:2402.15151. [Google Scholar]
Qiang, R.; Zhang, R.; Xie, P. Bilora: A bi-level optimization framework for overfitting-resilient low-rank adaptation of large pre-trained models. arXiv 2024, arXiv:2403.13037. [Google Scholar]
Gallego-Posada, J.; Ramirez, J.; Erraqabi, A.; Bengio, Y.; Lacoste-Julien, S. Controlled sparsity via constrained optimization or: How I learned to stop tuning penalties and love constraints. In Annual Conference on Neural Information Processing Systems; 2022. [Google Scholar]
Zhai, Y.; Zhang, H.; Lei, Y.; Yu, Y.; Xu, K.; Feng, D.; Ding, B.; Wang, H. Uncertainty-penalized reinforcement learning from human feedback with diverse reward lora ensembles. arXiv 2024, arXiv:2401.00243. [Google Scholar]
Jin, F.; Liu, Y.; Tan, Y. Derivative-free optimization for low-rank adaptation in large language models. arXiv 2024, arXiv:2403.01754. [Google Scholar] [CrossRef]
Jang, U.; Lee, J.D.; Ryu, E.K. Lora training in the NTK regime has no spurious local minima. arXiv 2024, arXiv:2402.11867. [Google Scholar]
Shen, Y.; Xu, Z.; Wang, Q.; Cheng, Y.; Yin, W.; Huang, L. Multimodal instruction tuning with conditional mixture of lora. arXiv 2024, arXiv:2402.15896. [Google Scholar]
Lee, A.N.; Hunter, C.J.; Ruiz, N. Platypus: Quick; cheap, and powerful refinement of llms. arXiv 2023, arXiv:2308.07317. [Google Scholar]
Zhou, H.; Lu, X.; Xu, W.; Zhu, C.; Zhao, T. Lora-drop: Efficient lora parameter pruning based on output evaluation. arXiv 2024, arXiv:2402.07721. [Google Scholar]
Zi, B.; Qi, X.; Wang, L.; Wang, J.; Wong, K.; Zhang, L. Delta-lora: Fine-tuning high-rank parameters with the delta of low-rank matrices. arXiv 2023, arXiv:2309.02411. [Google Scholar]
Sun, J.; Fu, D.; Hu, Y.; Wang, S.; Rassin, R.; Juan, D.-C.; Alon, D.; Herrmann, C.; van Steenkiste, S.; Krishna, R.; et al. Dreamsync: Aligning text-to-image generation with image understanding feedback. In Synthetic Data for Computer Vision Workshop@ CVPR 2024; 2023. [Google Scholar]
Luo, S.; Tan, Y.; Patil, S.; Gu, D.; von Platen, P.; Passos, A.; Huang, L.; Li, J.; Zhao, H. Lcm-lora: A universal stable-diffusion acceleration module. arXiv 2023, arXiv:2311.05556. [Google Scholar]
Meng, F.; Wang, Z.; Zhang, M. Pissa: Principal singular values and singular vectors adaptation of large language models. arXiv 2024, arXiv:2404.02948. [Google Scholar]
Ye, M.; Fang, X.; Du, B.; Yuen, P.C.; Tao, D. Heterogeneous federated learning: State-of-the-art and research challenges. ACM Computing Surveys 2024, 56, 79. [Google Scholar] [CrossRef]
Li, H.; Koto, F.; Wu, M.; Aji, A.F.; Baldwin, T. Bactrian-x: Multilingual replicable instruction-following models with low-rank adaptation. arXiv 2023, arXiv:2305.15011. [Google Scholar]
Valipour, M.; Rezagholizadeh, M.; Kobyzev, I.; Ghodsi, A. Dylora: Parameter efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. arXiv 2022, arXiv:2210.07558. [Google Scholar]
Sidahmed, H.; Phatale, S.; Hutcheson, A.; Lin, Z.; Chen, Z.; Yu, Z.; Jin, J.; Komarytsia, R.; Ahlheim, C.; Zhu, Y.; et al. Perl:parameter efficient reinforcement learning from human feedback. arXiv 2024, arXiv:2403.10704. [Google Scholar]
Sun, Y.; Li, M.; Cao, Y.; Wang, K.; Wang, W.; Zeng, X.; Zhao, R. To be or not to be? an exploration of continuously controllable prompt engineering. arXiv 2023, arXiv:2311.09773. [Google Scholar]
Quan, S. Dmoerm: Recipes of mixture-of-experts for effective reward modeling. arXiv 2024, arXiv:2403.01197. [Google Scholar]
Zhang, L.; Zhang, L.; Shi, S.; Chu, X.; Li, B. Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning. arXiv 2023, arXiv:2308.03303. [Google Scholar]
Wang, X.; Aitchison, L.; Rudolph, M. Lora ensembles for large language model fine-tuning. arXiv 2023, arXiv:2310.00035. [Google Scholar]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Proceedings of North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT; 2019; pp. 4171–4186. [Google Scholar]
Zhang, Y.; Wang, J.; Yu, L.; Xu, D.; Zhang, X. Personalized lora for human-centered text understanding. In Thirty-Eighth AAAI Conference on Artificial Intelligence; 2024; pp. 19588–19596. [Google Scholar]
Zaken, E.B.; Goldberg, Y.; Ravfogel, S. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); 2022; pp. 1–9. [Google Scholar]
Zhu, Y.; Yang, X.; Wu, Y.; Zhang, W. Parameter-efficient fine-tuning with layer pruning on free-text sequence-to-sequence modeling. arXiv 2023, arXiv:2305.08285. [Google Scholar]
Yang, A.X.; Robeyns, M.; Wang, X.; Aitchison, L. Bayesian low-rank adaptation for large language models. arXiv 2023, arXiv:2308.13111. [Google Scholar]
Chen, L.; Ye, Z.; Wu, Y.; Zhuo, D.; Ceze, L.; Krishnamurthy, A. Punica: Multi-tenant lora serving. Proceedings of Machine Learning and Systems; 2024; pp. 1–13. [Google Scholar]
Ding, H.; Gao, J.; Yuan, Y.; Wang, Q. Samlp: A customized segment anything model for license plate detection. arXiv 2024, arXiv:2401.06374. [Google Scholar]
Zeng, Y.; Lee, K. The expressive power of low-rank adaptation. arXiv 2023, arXiv:2310.17513. [Google Scholar]
Khandelwal, A. Infusion: Inject and attention fusion for multi concept zero-shot text-based video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023; pp. 3017–3026. [Google Scholar]
Louizos, C.; Welling, M.; Kingma, D.P. Learning sparse neural networks through l₀ regularization. arXiv 2017, arXiv:1712.01312. [Google Scholar]
Liu, Q.; Wu, X.; Zhao, X.; Zhu, Y.; Xu, D.; Tian, F.; Zheng, Y. Moelora: An moe-based parameter efficient fine-tuning method for multi-task medical applications. arXiv 2023, arXiv:2310.18339. [Google Scholar]
Yang, S.; Zhou, Y.; Liu, Z.; Loy, C.C. Rerender A video: Zero-shot text-guided video-to-video translation. In SIGGRAPH Asia 2023 Conference Papers; 2023; pp. 1–11. [Google Scholar]
Liu, S.; Wang, C.; Yin, H.; Molchanov, P.; Wang, Y.F.; Cheng, K.; Chen, M. Dora: Weight-decomposed low-rank adaptation. arXiv 2024, arXiv:2402.09353. [Google Scholar]
Ding, N.; Lv, X.; Wang, Q.; Chen, Y.; Zhou, B.; Liu, Z.; Sun, M. Sparse low-rank adaptation of pre-trained language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; 2023; pp. 4133–4145. [Google Scholar]
Fomenko, V.; Yu, H.; Lee, J.; Hsieh, S.; Chen, W. A note on lora. arXiv 2024, arXiv:2404.05086. [Google Scholar]
Zhang, S.; Chen, Z.; Chen, S.; Shen, Y.; Sun, Z.; Gan, C. Improving reinforcement learning from human feedback with efficient reward model ensemble. arXiv 2024, arXiv:2401.16635. [Google Scholar]
Zhang, J.; Chen, S.; Liu, J.; He, J. Composing parameter-efficient modules with arithmetic operations. arXiv 2023, arXiv:2306.14870. [Google Scholar]
Ding, N.; Qin, Y.; Yang, G.; Wei, F.; Yang, Z.; Su, Y.; Hu, S.; Chen, Y.; Chan, C.; Chen, W.; et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat. Mac. Intell. 2023, 5, 220–235. [Google Scholar] [CrossRef]
Ren, P.; Shi, C.; Wu, S.; Zhang, M.; Ren, Z.; de Rijke, M.; Chen, Z.; Pei, J. Mini-ensemble low-rank adapters for parameter-efficient fine-tuning. arXiv 2024, arXiv:2402.17263. [Google Scholar]
Feng, W.; Zhu, L.; Yu, L. Cheap lunch for medical image segmentation by fine-tuning SAM on few exemplars. arXiv 2023, arXiv:2308.14133. [Google Scholar]
Yang, H.; Wang, Y.; Xu, X.; Zhang, H.; Bian, Y. Can we trust llms? Mitigate overconfidence bias in llms through knowledge transfer. arXiv 2024, arXiv:2405.16856. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations; 2022. [Google Scholar]
Jiang, W.; Lin, B.; Shi, H.; Zhang, Y.; Li, Z.; Kwok, J.T. Effective; parameter-efficient reusing fine-tuned models. arXiv 2023, arXiv:2310.01886. [Google Scholar]
Ba, K.; Banaei, M.; Aberer, K.; Tabor, J. Lora-xs: Low-rank adaptation with extremely small number of parameters. arXiv 2024, arXiv:2405.17604. [Google Scholar]
Lialin, V.; Muckatira, S.; Shivagunde, N.; Rumshisky, A. Relora: High-rank training through low-rank updates. In The Twelfth International Conference on Learning Representations; 2023. [Google Scholar]
Zhao, J.; Zhang, Z.; Chen, B.; Wang, Z.; Anandkumar, A.; Tian, Y. Galore: Memory-efficient LLM training by gradient low-rank projection. arXiv 2024, arXiv:2403.03507. [Google Scholar]
Yan, Y.; Tang, S.; Shi, Z.; Yang, Q. FeDeRA: Efficient fine-tuning of language models in federated learning leveraging weight decomposition. arXiv 2024, arXiv:2404.18848. [Google Scholar]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
Liu, T.; Low, B.K.H. Goat: Fine-tuned llama outperforms GPT-4 on arithmetic tasks. arXiv 2023, arXiv:2305.14201. [Google Scholar]
Woo, S.; Park, B.; Kim, B.; Jo, M.; Kwon, S.; Jeon, D.; Lee, D. Dropbp: Accelerating fine-tuning of large language models by dropping backward propagation. arXiv 2024, arXiv:2402.17812. [Google Scholar]
Malladi, S.; Wettig, A.; Yu, D.; Chen, D.; Arora, S. A kernel-based view of language model fine-tuning. In International Conference on Machine Learning; 2023; pp. 23610–23641. [Google Scholar]
Yu, K.; Liu, J.; Feng, M.; Cui, M.; Xie, X. Boosting3d: High-fidelity image-to-3d by boosting 2d diffusion prior to 3d prior with progressive learning. arXiv 2023, arXiv:2311.13617. [Google Scholar]
Chitale, R.; Vaidya, A.; Kane, A.; Ghotkar, A. Task arithmetic with lora for continual learning. arXiv 2023, arXiv:2311.02428. [Google Scholar]
Yang, J. Longqlora: Efficient and effective method to extend context length of large language models. arXiv 2023, arXiv:2311.04879. [Google Scholar]
Zhao, Z.; Gan, L.; Wang, G.; Zhou, W.; Yang, H.; Kuang, K.; Wu, F. Loraretriever: Input-aware lora retrieval and composition for mixed tasks in the wild. arXiv 2024, arXiv:2402.09997. [Google Scholar]
Toma, A.; Lawler, P.R.; Ba, J.; Krishnan, R.G.; Rubin, B.B.; Wang, B. Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding. arXiv 2023, arXiv:2305.12031. [Google Scholar]
Roberson, R.; Kaki, G.; Trivedi, A. Analyzing the effectiveness of large language models on text-to-sql synthesis. arXiv 2024, arXiv:2401.12379. [Google Scholar]
Chen, Z.; Huang, H.; Andrusenko, A.; Hrinchuk, O.; Puvvada, K.C.; Li, J.; Ghosh, S.; Balam, J.; Ginsburg, B. SALM: speech-augmented language model with in-context learning for speech recognition and translation. arXiv 2023, arXiv:2310.09424. [Google Scholar]
Yi, L.; Yu, H.; Wang, G.; Liu, X.; Li, X. pFedLoRA: Model-Heterogeneous Personalized Federated Learning with LoRA Tuning. arXiv 2023, arXiv:2310.13283. [Google Scholar]
Guo, Y.; Yang, C.; Rao, A.; Wang, Y.; Qiao, Y.; Lin, D.; Dai, B. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv 2023, arXiv:2307.04725. [Google Scholar]
Geshkovski, B.; Letrouit, C.; Polyanskiy, Y.; Rigollet, P. The emergence of clusters in self-attention dynamics. In Annual Conference on Neural Information Processing Systems; 2023. [Google Scholar]
Sun, S.; Gupta, D.; Iyyer, M. Exploring the impact of low-rank adaptation on the performance, efficiency, and regularization of RLHF. arXiv 2023, arXiv:2309.09055. [Google Scholar]
Wu, Y.; Xiang, Y.; Huo, S.; Gong, Y.; Liang, P. Lora-sp: streamlined partial parameter adaptation for resource efficient fine-tuning of large language models. In Third International Conference on Algorithms, Microchips, and Network Applications; 2024; pp. 488–496. [Google Scholar]
Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Wu, Z.; Chang, B.; Sun, X.; Xu, J.; Li, L.; Sui, Z. A survey for in-context learning. arXiv 2023, arXiv:2301.00234. [Google Scholar]
Jeon, H.; Kim, Y.; Kim, J.-J. L4q: Parameter efficient quantization-aware training on large language models via lora-wise lsq. arXiv 2024, arXiv:2402.04902. [Google Scholar]
Deng, Y.; Wang, R.; Zhang, Y.; Tai, Y.; Tang, C. Dragvideo: Interactive drag-style video editing. arXiv 2023, arXiv:2312.02216. [Google Scholar]
Chen, Z.; Wang, Z.; Wang, Z.; Liu, H.; Yin, Z.; Liu, S.; Sheng, L.; Ouyang, W.; Qiao, Y.; Shao, J. Octavius: Mitigating task interference in mllms via moe. arXiv 2023, arXiv:2311.02684. [Google Scholar]
Zniyed, Y.; Nguyen, T.P.; et al. Enhanced network compression through tensor decompositions and pruning. IEEE Transactions on Neural Networks and Learning Systems 2024. [Google Scholar]
Goodfellow, I.J.; Bengio, Y.; Courville, A.C. Deep Learning, ser. Adaptive computation and machine learning; MIT Press, 2016. [Google Scholar]
Hansen, N.; Ostermeier, A. Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation. Proceedings of IEEE international conference on evolutionary computation; 1996; pp. 312–317. [Google Scholar]
Wang, S.; Chen, L.; Jiang, J.; Xue, B.; Kong, L.; Wu, C. Lora meets dropout under a unified framework. arXiv 2024, arXiv:2403.00812. [Google Scholar]
Geshkovski, B.; Letrouit, C.; Polyanskiy, Y.; Rigollet, P. A mathematical perspective on transformers. arXiv 2023, arXiv:2312.10794. [Google Scholar]
Zhong, M.; Shen, Y.; Wang, S.; Lu, Y.; Jiao, Y.; Ouyang, S.; Yu, D.; Han, J.; Chen, W. Multi-lora composition for image generation. arXiv 2024, arXiv:2402.16843. [Google Scholar]
Sheng, Y.; Cao, S.; Li, D.; Hooper, C.; Lee, N.; Yang, S.; Chou, C.; Zhu, B.; Zheng, L.; Keutzer, K.; et al. S-lora: Serving thousands of concurrent lora adapters. arXiv 2023, arXiv:2311.03285. [Google Scholar]
Li, J.; Lei, Y.; Bian, Y.; Cheng, D.; Ding, Z.; Jiang, C. Ra-cfgpt: Chinese financial assistant with retrieval-augmented large language model. Frontiers of Computer Science 2024, 18, 185350. [Google Scholar] [CrossRef]
Yoo, S.; Kim, K.; Kim, V.G.; Sung, M. As-plausible-as-possible: Plausibility-aware mesh deformation using 2d diffusion priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2024; pp. 4315–4324. [Google Scholar]
Tan, W.; Zhang, W.; Liu, S.; Zheng, L.; Wang, X.; An, B. True knowledge comes from practice: Aligning llms with embodied environments via reinforcement learning. arXiv 2024, arXiv:2401.14151. [Google Scholar]
Qi, Z.; Tan, X.; Shi, S.; Qu, C.; Xu, Y.; Qi, Y. PILLOW: enhancing efficient instruction fine-tuning via prompt matching. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track; 2023; pp. 471–482. [Google Scholar]
Gou, Y.; Liu, Z.; Chen, K.; Hong, L.; Xu, H.; Li, A.; Yeung, D.; Kwok, J.T.; Zhang, Y. Mixture of cluster-conditional lora experts for vision-language instruction tuning. arXiv 2023, arXiv:2312.12379. [Google Scholar]
Aghajanyan, A.; Gupta, S.; Zettlemoyer, L. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing; 2021; pp. 7319–7328. [Google Scholar]
Biderman, D.; Ortiz, J.J.G.; Portes, J.; Paul, M.; Greengard, P.; Jennings, C.; King, D.; Havens, S.; Chiley, V.; Frankle, J.; et al. Lora learns less and forgets less. arXiv 2024, arXiv:2405.09673. [Google Scholar]
Ge, Y.; Ge, Y.; Zeng, Z.; Wang, X.; Shan, Y. Planting a SEED of vision in large language model. arXiv 2023, arXiv:2307.08041. [Google Scholar]
Kopiczko, D.J.; Blankevoort, T.; Asano, Y.M. Vera: Vector-based random matrix adaptation. arXiv 2023, arXiv:2310.11454. [Google Scholar]
Salimans, T.; Kingma, D.P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in neural information processing systems; 2016; p. 901. [Google Scholar]
Yang, A.X.; Robeyns, M.; Coste, T.; Wang, J.; Bou-Ammar, H.; Aitchison, L. Bayesian reward models for LLM alignment. arXiv 2024, arXiv:2402.13210. [Google Scholar]
Sakaguchi; Bras, R.L.; Bhagavatula, C.; Choi, Y. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM 2021, 64, 99–106. [Google Scholar] [CrossRef]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Guo, Q.; Wang, M.; et al. Retrieval-augmented generation for large language models: A survey. arXiv 2023, arXiv:2312.10997. [Google Scholar]
Ye, Z.; Lovell, L.; Faramarzi, A.; Ninic, J. Sam-based instance segmentation models for the automation of structural damage detection. arXiv 2024, arXiv:2401.15266. [Google Scholar] [CrossRef]
Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv 2018, arXiv:1804.07461. [Google Scholar]
Shi, J.; Hua, H. Space narrative: Generating images and 3d scenes of chinese garden from text using deep learning. In xArch-creativity in the age of digital reproduction symposium; 2023; pp. 236–243. [Google Scholar]
Liao, B.; Monz, C. Apiq: Finetuning of 2-bit quantized large language model. arXiv 2024, arXiv:2402.05147. [Google Scholar]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Fine-Tuning Transformers Efficiently: A Survey on LoRA and Its Impact

Abstract

Keywords:

Subject:

I. Introduction

II. Background and Related Work

A. Fine-Tuning of Large Language Models

B. Parameter-Efficient Fine-Tuning (PEFT) Approaches

1) Adapter Layers

2) Prompt Tuning and Prefix Tuning

3) BitFit

4) Low-Rank Adaptation (LoRA)

C. Evolution of LoRA and Its Adoption in NLP

D. Summary

III. Mathematical Foundations of LoRA

A. Low-Rank Decomposition in Neural Networks

B. Parameter Efficiency and Complexity Reduction

C. Integration with Transformer Architectures

D. Rank Selection and Performance Trade-Offs

E. Comparison with Other Fine-Tuning Methods

F. Summary

IV. Practical Implementation of LoRA

A. Integrating LoRA into Deep Learning Frameworks

B. Training Strategies for LoRA

1) Optimizing the Learning Rate

2) Gradient Accumulation and Mixed Precision Training

3) Task-Specific Adaptation

C. Evaluation and Benchmarking

D. Real-World Applications of LoRA

1) Natural Language Processing (NLP)

2) Computer Vision

3) Biomedical and Healthcare Applications

4) Code Generation and Programming Assistance

E. Challenges and Best Practices

F. Summary

V. Recent Advancements and Ongoing Research

A. Hybrid Approaches: Combining LoRA with Other Fine-Tuning Techniques

1) LoRA + Prompt Tuning

2) LoRA + Prefix Tuning

3) LoRA + Adapter Layers

B. Adaptive Rank Selection and Dynamic LoRA

C. LoRA for Multimodal and Cross-Domain Applications

1) LoRA for Vision-Language Models

2) LoRA for Speech and Audio Processing

3) LoRA for Reinforcement Learning and Robotics

D. Optimizing LoRA for Efficient Inference

1) Quantized LoRA

2) Fusion of LoRA Adapters

3) LoRA for On-Device AI

E. Theoretical Insights into LoRA’s Effectiveness

F. Challenges and Future Directions

G. Summary

VI. Conclusion and Future Perspectives

A. Key Takeaways

B. Future Directions

1) Automated LoRA Optimization

2) Reducing Inference Overhead

3) Expanding LoRA Beyond Transformers

4) Continual Learning and On-Device Adaptation

5) LoRA for Foundation Models

C. Final Thoughts

References

MDPI Initiatives

Important Links

Subscribe