Submitted:
10 February 2025
Posted:
11 February 2025
You are already at the latest version
Abstract
Diffusion models have emerged as a powerful class of generative models, achieving state-of-the-art performance in image, audio, and video synthesis. However, their widespread adoption is hindered by high computational costs, slow inference times, and memory-intensive training. In response, numerous techniques have been proposed to enhance the efficiency of diffusion models while maintaining or improving generation quality. This survey provides a comprehensive review of recent advances in efficient diffusion models. We categorize these approaches into four key areas: (1) accelerated sampling methods, which reduce the number of function evaluations required for inference; (2) efficient model architectures, including lightweight U-Net and transformer variants; (3) knowledge distillation and model compression techniques, such as progressive distillation and pruning; and (4) hybrid generative frameworks that integrate diffusion models with alternative paradigms like GANs and VAEs. Additionally, we discuss open challenges, including the trade-offs between sampling speed and quality, memory-efficient training strategies, and real-time deployment on edge devices. We highlight promising research directions, such as adaptive sampling, hardware-aware optimizations, and self-distilling diffusion models. By providing a structured overview of efficiency-focused advancements, we aim to guide future research toward making diffusion models more practical, scalable, and accessible for real-world applications.
Keywords:
1. Introduction
- Accelerated Sampling Techniques: Methods such as improved numerical solvers, stochastic samplers, and adaptive step-size schedules aim to reduce the number of inference steps while maintaining sample quality [9].
- Knowledge Distillation and Model Compression: Techniques such as teacher-student training, distillation of diffusion processes into simpler models, and progressive denoising help create compact yet effective diffusion models [12].
- Hybrid and Alternative Frameworks: Combining diffusion models with GANs, VAEs, or implicit models seeks to leverage the strengths of each approach to achieve both efficiency and high-quality generation [13].
2. Background
2.1. Forward and Reverse Diffusion Processes
2.2. Training Objective
2.3. Key Variants of Diffusion Models
- Denoising Diffusion Probabilistic Models (DDPMs): The original formulation, introduced by Ho et al., demonstrating the capability of diffusion-based generative models [24].
- Denoising Diffusion Implicit Models (DDIMs): A more efficient variant that enables deterministic sampling with fewer steps [25].
- Score-Based Generative Models (SGMs): Leveraging score matching and stochastic differential equations to generalize diffusion processes.
- Latent Diffusion Models (LDMs): Reducing computational complexity by applying diffusion in a lower-dimensional latent space instead of pixel space [26].
2.4. Challenges and Computational Bottlenecks
- Slow Inference: Generating a single sample requires multiple forward passes, making real-time applications difficult.
- High Memory Requirements: Large neural networks with deep architectures demand significant GPU resources [27].
- Training Complexity: Optimizing diffusion models requires substantial computational power and time, limiting accessibility [28].
3. Efficient Diffusion Models
3.1. Accelerated Sampling Methods
3.1.1. Denoising Diffusion Implicit Models (DDIMs)
3.1.2. Higher-Order Solvers and ODE-Based Methods
3.1.3. Latent Diffusion Models (LDMs)
3.2. Efficient Model Architectures
3.2.1. Lightweight U-Net Architectures
3.2.2. Transformer-Based Diffusion Models
3.3. Knowledge Distillation and Model Compression
3.3.1. Teacher-Student Knowledge Distillation
3.3.2. Progressive Distillation
3.3.3. Quantization and Pruning
3.4. Hybrid Generative Models
3.4.1. Diffusion-GAN Hybrids
3.4.2. Variational Diffusion Models
3.4.3. Score-Based Energy Models
3.5. Summary of Efficiency Improvements
4. Open Challenges and Future Directions
4.1. Balancing Speed and Sample Quality
4.2. Memory-Efficient Training Strategies
4.3. Lightweight Model Architectures
- Neural Architecture Search (NAS) for discovering optimal architectures that balance efficiency and performance [80].
- Sparse and low-rank models to reduce the number of parameters while preserving expressive power [11].
- Modular diffusion networks that adapt dynamically based on computational constraints [81].
4.4. Fast and Adaptive Sampling Methods
4.5. Efficient Conditional Diffusion Models
4.6. Self-Distillation and Online Learning
4.7. Hybrid Generative Frameworks
4.8. Hardware Optimization and Deployment
4.9. Theoretical Understanding of Efficiency in Diffusion Models
4.10. Summary of Future Directions
5. Conclusion
References
- Dockhorn, T.; Vahdat, A.; Kreis, K. Score-Based Generative Modeling with Critically-Damped Langevin Diffusion. In Proceedings of the International Conference on Learning Representations; 2021. [Google Scholar]
- Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved baselines with visual instruction tuning. arXiv preprint 2023, arXiv:2310.03744. [Google Scholar]
- Fan, Y.; Lee, K. Optimizing DDPM Sampling with Shortcut Fine-Tuning. In Proceedings of the International Conference on Machine Learning; 2023; pp. 9623–9639. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:cs.CL/2307.09288. [Google Scholar]
- Buciluǎ, C.; Caruana, R.; Niculescu-Mizil, A. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining; 2006; pp. 535–541. [Google Scholar]
- Lu, C.; Zhou, Y.; Bao, F.; Chen, J.; Li, C.; Zhu, J. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint 2022, arXiv:2211.01095. [Google Scholar]
- Kazemzadeh, S.; Ordonez, V.; Matten, M.; Berg, T. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP); 2014; pp. 787–798. [Google Scholar]
- Chen, L.; Zhao, H.; Liu, T.; Bai, S.; Lin, J.; Zhou, C.; Chang, B. An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models. arXiv 2024, arXiv:cs.CV/2403.06764. [Google Scholar]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International conference on machine learning; PMLR, 2023; pp. 19730–19742. [Google Scholar]
- Brock, A.; Donahue, J.; Simonyan, K. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In Proceedings of the International Conference on Learning Representations; 2019. [Google Scholar]
- Zniyed, Y.; Nguyen, T.P.; et al. Efficient tensor decomposition-based filter pruning. Neural Networks 2024, 178, 106393. [Google Scholar]
- Fayyaz, M.; Koohpayegani, S.A.; Jafari, F.R.; Sengupta, S.; Joze, H.R.V.; Sommerlade, E.; Pirsiavash, H.; Gall, J. Adaptive token sampling for efficient vision transformers. In Proceedings of the European Conference on Computer Vision; Springer, 2022; pp. 396–414. [Google Scholar]
- Fedus, W.; Zoph, B.; Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research 2022, 23, 1–39. [Google Scholar]
- Du, D.; Gong, G.; Chu, X. Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey. arXiv preprint 2024, arXiv:2405.00314. [Google Scholar]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv preprint 2023, arXiv:2303.08774. [Google Scholar]
- Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv preprint 2021, arXiv:2111.00396. [Google Scholar]
- Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the International Conference on Computer Vision (ICCV); 2021. [Google Scholar]
- Changpinyo, S.; Sharma, P.; Ding, N.; Soricut, R. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021; pp. 3558–3568. [Google Scholar]
- Kuznedelev, D.; Kurtić, E.; Frantar, E.; Alistarh, D. CAP: Correlation-Aware Pruning for Highly-Accurate Sparse Vision Models. Advances in Neural Information Processing Systems 2024, 36. [Google Scholar]
- Karras, T.; Aittala, M.; Laine, S.; Aila, T. Elucidating the design space of diffusion-based generative models. In Proceedings of the 36th International Conference on Neural Information Processing Systems; 2022; pp. 26565–26577. [Google Scholar]
- Hu, Z.; Lan, Y.; Wang, L.; Xu, W.; Lim, E.P.; Lee, R.K.W.; Bing, L.; Poria, S. LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models. arXiv preprint 2023, arXiv:2304.01933. [Google Scholar]
- Xu, S.; Li, Y.; Ma, T.; Zeng, B.; Zhang, B.; Gao, P.; Lv, J. TerViT: An efficient ternary vision transformer. arXiv preprint 2022, arXiv:2201.08050. [Google Scholar]
- Song, Y.; Dhariwal, P.; Chen, M.; Sutskever, I. Consistency Models. In Proceedings of the International Conference on Machine Learning; 2023; pp. 32211–32252. [Google Scholar]
- Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. Advances in neural information processing systems 2021, 34, 8780–8794. [Google Scholar]
- Liu, X.; Gong, C.; et al. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. In Proceedings of the The Eleventh International Conference on Learning Representations; 2022. [Google Scholar]
- Shang, Y.; Yuan, Z.; Xie, B.; Wu, B.; Yan, Y. Post-training quantization on diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2023; pp. 1972–1981. [Google Scholar]
- Jolicoeur-Martineau, A.; Piché-Taillefer, R.; Mitliagkas, I.; des Combes, R.T. Adversarial score matching and improved sampling for image generation. In Proceedings of the International Conference on Learning Representations; 2021. [Google Scholar]
- Wu, K.; Zhang, J.; Peng, H.; Liu, M.; Xiao, B.; Fu, J.; Yuan, L. Tinyvit: Fast pretraining distillation for small vision transformers. In Proceedings of the European Conference on Computer Vision. Springer; 2022; pp. 68–85. [Google Scholar]
- Laurençon, H.; Saulnier, L.; Tronchon, L.; Bekman, S.; Singh, A.; Lozhkov, A.; Wang, T.; Karamcheti, S.; Rush, A.; Kiela, D.; et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems 2024, 36. [Google Scholar]
- Li, Y.; Hu, J.; Wen, Y.; Evangelidis, G.; Salahi, K.; Wang, Y.; Tulyakov, S.; Ren, J. Rethinking vision transformers for mobilenet size and speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023; pp. 16889–16900. [Google Scholar]
- Renggli, C.; Pinto, A.S.; Houlsby, N.; Mustafa, B.; Puigcerver, J.; Riquelme, C. Learning to merge tokens in vision transformers. arXiv preprint 2022, arXiv:2202.12015. [Google Scholar]
- Valipour, M.; Rezagholizadeh, M.; Kobyzev, I.; Ghodsi, A. Dylora: Parameter efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. arXiv preprint 2022, arXiv:2210.07558. [Google Scholar]
- Driess, D.; Xia, F.; Sajjadi, M.S.M.; Lynch, C.; Chowdhery, A.; Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; et al. PALM-E: An Embodied Multimodal Language Model. arXiv preprint 2023, arXiv:2303.03378. [Google Scholar]
- Lin, J.; Yin, H.; Ping, W.; Lu, Y.; Molchanov, P.; Tao, A.; Mao, H.; Kautz, J.; Shoeybi, M.; Han, S. Vila: On pre-training for visual language models. arXiv preprint 2023, arXiv:2312.07533. [Google Scholar]
- Li, K.; He, Y.; Wang, Y.; Li, Y.; Wang, W.; Luo, P.; Wang, Y.; Wang, L.; Qiao, Y. Videochat: Chat-centric video understanding. arXiv preprint 2023, arXiv:2305.06355. [Google Scholar]
- Qiao, Y.; Yu, Z.; Guo, L.; Chen, S.; Zhao, Z.; Sun, M.; Wu, Q.; Liu, J. VL-Mamba: Exploring State Space Models for Multimodal Learning. arXiv preprint 2024, arXiv:2403.13600. [Google Scholar]
- Li, Y.; Wang, C.; Jia, J. LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models. arXiv 2023, arXiv:cs.CV/2311.17043. [Google Scholar]
- He, Y.; Liu, L.; Liu, J.; Wu, W.; Zhou, H.; Zhuang, B. PTQD: accurate post-training quantization for diffusion models. In Proceedings of the 37th International Conference on Neural Information Processing Systems; 2023; pp. 13237–13249. [Google Scholar]
- Yu, C.; Chen, T.; Gan, Z.; Fan, J. Boost vision transformer with gpu-friendly sparsity and quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023; pp. 22658–22668. [Google Scholar]
- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2022; pp. 11976–11986. [Google Scholar]
- Xu, H.; Ye, Q.; Yan, M.; Shi, Y.; Ye, J.; Xu, Y.; Li, C.; Bi, B.; Qian, Q.; Wang, W.; et al. mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video. arXiv 2023, arXiv:cs.CV/2302.00402. [Google Scholar]
- Xiao, Z.; Kreis, K.; Vahdat, A. Tackling the Generative Learning Trilemma with Denoising Diffusion GANs. In Proceedings of the International Conference on Learning Representations; 2022. [Google Scholar]
- Le, P.H.C.; Li, X. BinaryViT: pushing binary vision transformers towards convolutional models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023; pp. 4664–4673. [Google Scholar]
- Papa, L.; Russo, P.; Amerini, I.; Zhou, L. A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking. IEEE Transactions on Pattern Analysis and Machine Intelligence 2024. [Google Scholar] [CrossRef] [PubMed]
- Sauer, A.; Lorenz, D.; Blattmann, A.; Rombach, R. Adversarial diffusion distillation. arXiv preprint 2023, arXiv:2311.17042. [Google Scholar]
- DeepSeek-AI. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism. arXiv preprint 2024, arXiv:2401.02954.
- Kar, O.F.; Tonioni, A.; Poklukar, P.; Kulshrestha, A.; Zamir, A.; Tombari, F. BRAVE: Broadening the visual encoding of vision-language models. arXiv preprint 2024, arXiv:2404.07204. [Google Scholar]
- Sidorov, O.; Hu, R.; Rohrbach, M.; Singh, A. Textcaps: a dataset for image captioning with reading comprehension. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part II 16. Springer, 2020; pp. 742–758. [Google Scholar]
- Mathew, M.; Karatzas, D.; Jawahar, C. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision; 2021; pp. 2200–2209. [Google Scholar]
- Zhou, B.; Hu, Y.; Weng, X.; Jia, J.; Luo, J.; Liu, X.; Wu, J.; Huang, L. TinyLLaVA: A Framework of Small-scale Large Multimodal Models. arXiv preprint 2024, arXiv:2402.14289. [Google Scholar]
- Luo, S.; Tan, Y.; Patil, S.; Gu, D.; von Platen, P.; Passos, A.; Huang, L.; Li, J.; Zhao, H. Lcm-lora: A universal stable-diffusion acceleration module. arXiv preprint 2023, arXiv:2311.05556. [Google Scholar]
- Wu, Q.; Ye, W.; Zhou, Y.; Sun, X.; Ji, R. Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models. arXiv preprint 2024, arXiv:2403.15226. [Google Scholar]
- Chu, X.; Qiao, L.; Zhang, X.; Xu, S.; Wei, F.; Yang, Y.; Sun, X.; Hu, Y.; Lin, X.; Zhang, B.; et al. MobileVLM V2: Faster and Stronger Baseline for Vision Language Model. arXiv preprint 2024, arXiv:2402.03766. [Google Scholar]
- Lee, J.; Lee, Y.; Kim, J.; Kosiorek, A.; Choi, S.; Teh, Y.W. Set transformer: A framework for attention-based permutation-invariant neural networks. In Proceedings of the International conference on machine learning; PMLR, 2019; pp. 3744–3753. [Google Scholar]
- Malladi, S.; Gao, T.; Nichani, E.; Damian, A.; Lee, J.D.; Chen, D.; Arora, S. Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems 2023, 36, 53038–53075. [Google Scholar]
- Shi, B.; Wu, Z.; Mao, M.; Wang, X.; Darrell, T. When Do We Not Need Larger Vision Models? arXiv preprint 2024, arXiv:2403.13043. [Google Scholar]
- Zniyed, Y.; Nguyen, T.P.; et al. Enhanced network compression through tensor decompositions and pruning. IEEE Transactions on Neural Networks and Learning Systems 2024. [Google Scholar]
- Chen, C.; Borgeaud, S.; Irving, G.; Lespiau, J.B.; Sifre, L.; Jumper, J. Accelerating large language model decoding with speculative sampling. arXiv preprint 2023, arXiv:2302.01318. [Google Scholar]
- Guo, Y.; Yang, C.; Rao, A.; Liang, Z.; Wang, Y.; Qiao, Y.; Agrawala, M.; Lin, D.; Dai, B. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning. In Proceedings of the International Conference on Learning Representations; 2024. [Google Scholar]
- Yue, X.; Ni, Y.; Zhang, K.; Zheng, T.; Liu, R.; Zhang, G.; Stevens, S.; Jiang, D.; Ren, W.; Sun, Y.; et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint 2023, arXiv:2311.16502. [Google Scholar]
- Shao, Z.; Ouyang, X.; Gai, Z.; Yu, Z.; Yu, J. Imp: An emprical study of multimodal small language models. 2024. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021. [Google Scholar]
- Ordonez, V.; Kulkarni, G.; Berg, T. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems 2011, 24. [Google Scholar]
- Liu, L.; Ren, Y.; Lin, Z.; Zhao, Z. Pseudo Numerical Methods for Diffusion Models on Manifolds. In Proceedings of the International Conference on Learning Representations; 2022. [Google Scholar]
- Watson, D.; Chan, W.; Ho, J.; Norouzi, M. Learning fast samplers for diffusion models by differentiating through sample quality. In Proceedings of the International Conference on Learning Representations; 2022. [Google Scholar]
- Xue, S.; Liu, Z.; Chen, F.; Zhang, S.; Hu, T.; Xie, E.; Li, Z. Accelerating Diffusion Sampling with Optimized Time Steps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2024; pp. 8292–8301. [Google Scholar]
- Wang, A.; Chen, H.; Lin, Z.; Zhao, S.; Han, J.; Ding, G. CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs. arXiv preprint 2023, arXiv:2309.15755. [Google Scholar]
- Sun, Q.; Cui, Y.; Zhang, X.; Zhang, F.; Yu, Q.; Luo, Z.; Wang, Y.; Rao, Y.; Liu, J.; Huang, T.; et al. Generative multimodal models are in-context learners. arXiv preprint 2023, arXiv:2312.13286. [Google Scholar]
- Zhu, Y.; Zhu, M.; Liu, N.; Ou, Z.; Mou, X.; Tang, J. LLaVA-phi: Efficient Multi-Modal Assistant with Small Language Model. arXiv preprint 2024, arXiv:2401.02330. [Google Scholar]
- He, M.; Liu, Y.; Wu, B.; Yuan, J.; Wang, Y.; Huang, T.; Zhao, B. Efficient Multimodal Learning from Data-centric Perspective. arXiv preprint 2024, arXiv:2402.11530. [Google Scholar]
- He, Y.; Lou, Z.; Zhang, L.; Liu, J.; Wu, W.; Zhou, H.; Zhuang, B. BiViT: Extremely Compressed Binary Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023; pp. 5651–5663. [Google Scholar]
- Luhman, E.; Luhman, T. Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint 2021, arXiv:2101.02388. [Google Scholar]
- Heo, B.; Yun, S.; Han, D.; Chun, S.; Choe, J.; Oh, S.J. Rethinking spatial dimensions of vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision; 2021; pp. 11936–11945. [Google Scholar]
- Chen, X.; Fang, H.; Lin, T.Y.; Vedantam, R.; Gupta, S.; Dollár, P.; Zitnick, C.L. Microsoft coco captions: Data collection and evaluation server. arXiv preprint 2015, arXiv:1504.00325. [Google Scholar]
- Liu, Z.; Wang, Y.; Han, K.; Zhang, W.; Ma, S.; Gao, W. Post-training quantization for vision transformer. Advances in Neural Information Processing Systems 2021, 34, 28092–28103. [Google Scholar]
- Salimans, T.; Ho, J. Progressive Distillation for Fast Sampling of Diffusion Models. In Proceedings of the International Conference on Learning Representations; 2022. [Google Scholar]
- Marino, K.; Rastegari, M.; Farhadi, A.; Mottaghi, R. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition; 2019; pp. 3195–3204. [Google Scholar]
- Kingma, D.P.; Salimans, T.; Poole, B.; Ho, J. Variational diffusion models. In Proceedings of the 35th International Conference on Neural Information Processing Systems; 2021; pp. 21696–21707. [Google Scholar]
- Watson, D.; Ho, J.; Norouzi, M.; Chan, W. Learning to efficiently sample from diffusion probabilistic models. arXiv preprint 2021, arXiv:2106.03802. [Google Scholar]
- Tang, Y.; Han, K.; Wang, Y.; Xu, C.; Guo, J.; Xu, C.; Tao, D. Patch slimming for efficient vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022; pp. 12165–12174. [Google Scholar]
- Dai, W.; Li, J.; Li, D.; Tiong, A.M.H.; Zhao, J.; Wang, W.; Li, B.; Fung, P.N.; Hoi, S. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems 2024, 36. [Google Scholar]
- Cai, W.; Jiang, J.; Wang, F.; Tang, J.; Kim, S.; Huang, J. A survey on mixture of experts. arXiv preprint 2024, arXiv:2407.06204. [Google Scholar]
- Li, W.; Wang, X.; Xia, X.; Wu, J.; Xiao, X.; Zheng, M.; Wen, S. Sepvit: Separable vision transformer. arXiv preprint 2022, arXiv:2203.15380. [Google Scholar]
- Zhao, B.; Wu, B.; Huang, T. Svit: Scaling up visual instruction tuning. arXiv preprint 2023, arXiv:2307.04087. [Google Scholar]
- Luo, S.; Tan, Y.; Huang, L.; Li, J.; Zhao, H. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint 2023, arXiv:2310.04378. [Google Scholar]
- Fu, C.; Chen, P.; Shen, Y.; Qin, Y.; Zhang, M.; Lin, X.; Yang, J.; Zheng, X.; Li, K.; Sun, X.; et al. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv 2024, arXiv:cs.CV/2306.13394. [Google Scholar]
- Xu, R.; Yao, Y.; Guo, Z.; Cui, J.; Ni, Z.; Ge, C.; Chua, T.S.; Liu, Z.; Sun, M.; Huang, G. LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images. arXiv 2024, arXiv:cs.CV/2403.11703. [Google Scholar]
- Javaheripi, M.; Bubeck, S.; Abdin, M.; Aneja, J.; Bubeck, S.; Mendes, C.C.T.; Chen, W.; Del Giorno, A.; Eldan, R.; Gopi, S.; et al. Phi-2: The surprising power of small language models. Microsoft Research Blog 2023. [Google Scholar]
- Zhang, L.; Zhang, L.; Shi, S.; Chu, X.; Li, B. Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning. arXiv preprint 2023, arXiv:2308.03303. [Google Scholar]
- Yao, Y.; Yu, T.; Wang, C.; Cui, J.; Zhu, H.; Cai, T.; Zhao, W.; Zhang, K.; Hong, Y.; Li, H.; et al. MiniCPM-V 2.0: An Efficient End-side MLLM with Strong OCR and Understanding Capabilities. 2024. Available online: https://github.com/OpenBMB/MiniCPM-V.
- Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the International Conference on Learning Representations; 2021. [Google Scholar]
- Chen, X.; Cao, Q.; Zhong, Y.; Zhang, J.; Gao, S.; Tao, D. Dearkd: Data-efficient early knowledge distillation for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022; pp. 12052–12062. [Google Scholar]
- Luo, G.; Zhou, Y.; Ren, T.; Chen, S.; Sun, X.; Ji, R. Cheap and quick: Efficient vision-language instruction tuning for large language models. Advances in Neural Information Processing Systems 2024, 36. [Google Scholar]
- Li, Y.; Du, Y.; Zhou, K.; Wang, J.; Zhao, W.X.; Wen, J.R. Evaluating object hallucination in large vision-language models. arXiv preprint 2023, arXiv:2305.10355. [Google Scholar]
- Xiao, J.; Li, Z.; Yang, L.; Gu, Q. BinaryViT: Towards Efficient and Accurate Binary Vision Transformers. arXiv preprint 2023, arXiv:2305.14730. [Google Scholar] [CrossRef]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representations; 2022. [Google Scholar]
- Zhu, B.; Lin, B.; Ning, M.; Yan, Y.; Cui, J.; Wang, H.; Pang, Y.; Jiang, W.; Zhang, J.; Li, Z.; et al. LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. arXiv preprint 2023, arXiv:2310.01852. [Google Scholar]
- Gong, C.; Wang, D. NASViT: Neural architecture search for efficient vision transformers with gradient conflict-aware supernet training. ICLR Proceedings 2022 2022. [Google Scholar]
- Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.J.; Shamma, D.A.; et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 2017, 123, 32–73. [Google Scholar] [CrossRef]
- Dockhorn, T.; Vahdat, A.; Kreis, K. GENIE: higher-order denoising diffusion solvers. In Proceedings of the 36th International Conference on Neural Information Processing Systems; 2022; pp. 30150–30166. [Google Scholar]
- Fang, A.; Jose, A.M.; Jain, A.; Schmidt, L.; Toshev, A.; Shankar, V. Data filtering networks. arXiv preprint 2023, arXiv:2309.17425. [Google Scholar]
- Gurari, D.; Li, Q.; Stangl, A.J.; Guo, A.; Lin, C.; Grauman, K.; Luo, J.; Bigham, J.P. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition; 2018; pp. 3608–3617. [Google Scholar]
- Zheng, H.; Nie, W.; Vahdat, A.; Azizzadenesheli, K.; Anandkumar, A. Fast sampling of diffusion models via operator learning. In Proceedings of the International conference on machine learning; 2023; pp. 42390–42402. [Google Scholar]
- Cha, J.; Kang, W.; Mun, J.; Roh, B. Honeybee: Locality-enhanced projector for multimodal llm. arXiv preprint 2023, arXiv:2312.06742. [Google Scholar]
- Wan, Z.; Wang, X.; Liu, C.; Alam, S.; Zheng, Y.; Liu, J.; Qu, Z.; Yan, S.; Zhu, Y.; Zhang, Q.; et al. Efficient Large Language Models: A Survey. arXiv 2024, arXiv:cs.CL/2312.03863. [Google Scholar]
- Mescheder, L.; Geiger, A.; Nowozin, S. Which training methods for GANs do actually converge? In Proceedings of the International conference on machine learning; 2018; pp. 3481–3490. [Google Scholar]
- Lin, Z.; Lin, M.; Lin, L.; Ji, R. Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference. arXiv 2024, arXiv:cs.CV/2405.05803. [Google Scholar]
- Byeon, M.; Park, B.; Kim, H.; Lee, S.; Baek, W.; Kim, S. COYO-700M: Image-Text Pair Dataset. 2022. Available online: https://github.com/kakaobrain/coyo-dataset.
- Zhang, H.; Li, X.; Bing, L. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint 2023, arXiv:2306.02858. [Google Scholar]
- Peng, Z.; Wang, W.; Dong, L.; Hao, Y.; Huang, S.; Ma, S.; Wei, F. Kosmos-2: Grounding Multimodal Large Language Models to the World. ArXiv 2023, abs/2306. [Google Scholar]
- Chen, K.; Zhang, Z.; Zeng, W.; Zhang, R.; Zhu, F.; Zhao, R. Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. arXiv preprint 2023, arXiv:2306.15195. [Google Scholar]
- Laurençon, H.; Saulnier, L.; Tronchon, L.; Bekman, S.; Singh, A.; Lozhkov, A.; Wang, T.; Karamcheti, S.; Rush, A.; Kiela, D.; et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems 2024, 36. [Google Scholar]
- Zhou, Q.; Sheng, K.; Zheng, X.; Li, K.; Sun, X.; Tian, Y.; Chen, J.; Ji, R. Training-free transformer architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022; pp. 10894–10903. [Google Scholar]
- Chen, J.; Liu, Y.; Li, D.; An, X.; Feng, Z.; Zhao, Y.; Xie, Y. Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models. arXiv preprint 2024, arXiv:2403.19322. [Google Scholar]
- Gagrani, M.; Goel, R.; Jeon, W.; Park, J.; Lee, M.; Lott, C. On Speculative Decoding for Multimodal Large Language Models. arXiv 2024, arXiv:cs.CL/2404.08856. [Google Scholar]
- Ainslie, J.; Lee-Thorp, J.; de Jong, M.; Zemlyanskiy, Y.; Lebrón, F.; Sanghai, S. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv 2023, arXiv:cs.CL/2305.13245. [Google Scholar]
- Liang, Y.; Ge, C.; Tong, Z.; Song, Y.; Wang, J.; Xie, P. Not all patches are what you need: Expediting vision transformers via token reorganizations. arXiv preprint 2022, arXiv:2202.07800. [Google Scholar]
- Dong, P.; Lu, L.; Wu, C.; Lyu, C.; Yuan, G.; Tang, H.; Wang, Y. PackQViT: Faster Sub-8-bit Vision Transformers via Full and Packed Quantization on the Mobile. Advances in Neural Information Processing Systems 2024, 36. [Google Scholar]
- Huang, L.; Wu, S.; Cui, Y.; Xiong, Y.; Liu, X.; Kuo, T.W.; Guan, N.; Xue, C.J. RAEE: A Training-Free Retrieval-Augmented Early Exiting Framework for Efficient Inference. arXiv preprint 2024, arXiv:2405.15198. [Google Scholar]
- Chung, H.; Sim, B.; Ye, J.C. Come-closer-diffuse-faster: Accelerating conditional diffusion models for inverse problems through stochastic contraction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022; pp. 12413–12422. [Google Scholar]
- Yu, Y.Q.; Liao, M.; Wu, J.; Liao, Y.; Zheng, X.; Zeng, W. TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models. arXiv preprint 2024, arXiv:2404.09204. [Google Scholar]
- Ding, Y.; Qin, H.; Yan, Q.; Chai, Z.; Liu, J.; Wei, X.; Liu, X. Towards accurate post-training quantization for vision transformer. In Proceedings of the 30th ACM International Conference on Multimedia; 2022; pp. 5380–5388. [Google Scholar]
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint 2023, arXiv:2312.00752. [Google Scholar]
- He, B.; Li, H.; Jang, Y.K.; Jia, M.; Cao, X.; Shah, A.; Shrivastava, A.; Lim, S.N. MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding. arXiv 2024, arXiv:cs.CV/2404.05726. [Google Scholar]
| Method | Speedup | Memory Reduction | Quality Trade-off |
|---|---|---|---|
| DDIMs | High | Low | Minimal |
| Latent Diffusion Models | High | High | Minimal |
| Transformer-Based Models | Medium | Medium | Variable |
| Knowledge Distillation | High | High | Some |
| Hybrid Generative Models | Medium | Medium | Variable |
| Challenge | Potential Research Directions |
|---|---|
| Sampling Speed vs [110]. Quality | Adaptive solvers, learned step sizes, hybrid samplers |
| Training Efficiency | Gradient checkpointing, mixed-precision, distributed training |
| Model Architecture | NAS, sparse networks, efficient transformers |
| Conditional Diffusion | Optimized attention, sparse conditioning, multi-modal inputs |
| Distillation and Learning | Self-distillation, progressive training, online adaptation |
| Hybrid Generative Models | GAN-diffusion hybrids, energy-based models, latent-space diffusion |
| Hardware and Deployment | AI accelerators, mobile optimization, inference frameworks |
| Theoretical Foundations | Optimal sampling theory, diffusion-GAN connections |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
