Submitted:
31 January 2025
Posted:
03 February 2025
You are already at the latest version
Abstract
Diffusion models have emerged as a powerful class of generative models, offering state-of-the-art performance across various domains such as image synthesis, audio generation, and molecular design. Their unique approach, which involves modeling data distributions through iterative noise addition and denoising processes, has established them as a robust alternative to traditional generative frameworks like GANs and VAEs. However, the scalability of diffusion models—essential for handling high-dimensional data, large-scale datasets, and complex multimodal tasks—poses significant challenges. This survey provides a comprehensive overview of scalable diffusion models, focusing on the innovations that enable their efficient training and sampling. We explore advancements in noise schedules, neural architectures, and sampling acceleration techniques, alongside strategies for training on large-scale datasets and deploying models in resource-constrained environments. Furthermore, we highlight the transformative applications of scalable diffusion models across fields such as creative content generation, healthcare, scientific research, and more. Despite their successes, diffusion models face critical challenges, including computational inefficiency, resource-intensive training, and ethical concerns related to bias and misuse. We discuss these open challenges and outline promising directions for future research, emphasizing the need for interdisciplinary collaboration and task-specific adaptations. By addressing these challenges, scalable diffusion models have the potential to redefine the boundaries of generative modeling, driving innovation and enabling new applications in science, technology, and creative industries. This survey aims to serve as a valuable resource for researchers and practitioners seeking to understand and advance the field of diffusion models.
Keywords:
1. Introduction
2. Background and Fundamentals
2.1. Diffusion Processes and Stochastic Modeling
2.2. Training Objective
2.3. Relationship to Other Generative Models
2.4. Sampling Process
2.5. Scalability Challenges
3. Scalability Techniques for Diffusion Models
3.1. Efficient Noise Schedules
3.2. Improved Neural Architectures
- Cross-Attention Mechanisms: Incorporating cross-attention layers enables diffusion models to handle multimodal inputs, such as text-to-image tasks, by effectively fusing information from different modalities [47].
- Hierarchical Models: Leveraging hierarchical structures allows models to process data at multiple resolutions, reducing computational overhead while maintaining high-quality outputs.
- Lightweight Architectures: Designing lightweight networks with fewer parameters reduces memory requirements and training time, making diffusion models more accessible [48].
3.3. Accelerated Sampling Techniques
- Denoising Diffusion Implicit Models (DDIM): DDIM introduces a deterministic sampling process that reduces the number of timesteps required, enabling faster generation [49].
- Dynamic Programming Methods: These methods optimize the reverse process by adaptively selecting timesteps, focusing computational resources where they are most needed [50].
- Score-Based Methods: Score-based generative models approximate the gradient of the data distribution, allowing for more efficient sampling through improved step sizes and noise schedules.
3.4. Training on Large-Scale Datasets
- Distributed Training: Leveraging distributed computing frameworks enables efficient training across multiple GPUs or TPUs, significantly reducing training time.
- Curriculum Learning: Gradually increasing the complexity of training data helps models converge faster and achieve better generalization [53].
- Synthetic Data Augmentation: Augmenting datasets with synthetic samples generated by smaller models can bootstrap the training of larger diffusion models [54].
3.5. Compression Techniques for Diffusion Models
3.5.1. Model Pruning
- Magnitude-Based Pruning: Parameters with magnitudes below a certain threshold are removed, simplifying the model without significant loss in performance.
- Iterative Pruning and Fine-Tuning: Pruning is performed iteratively, followed by fine-tuning to recover any lost performance.
3.5.2. Quantization
- Post-Training Quantization: Quantization is applied after training without additional modifications to the training process.
- Quantization-Aware Training: The model is trained with quantization in mind, improving its robustness to reduced precision.
- Mixed-Precision Techniques: Different parts of the model are quantized to varying levels of precision based on their sensitivity to quantization errors.
3.5.3. Knowledge Distillation
- Distillation Objectives: Designing loss functions that align the student’s outputs with the teacher’s, including logits, intermediate feature maps, or denoising trajectories.
- Task-Specific Distillation: Tailoring the distillation process to specific tasks, such as image synthesis or text-to-image generation.
3.5.4. Parameter Sharing and Factorization
- Weight Sharing: Sharing weights across different layers or timesteps to reduce model size.
- Low-Rank Factorization: Decomposing weight matrices into low-rank approximations, reducing the number of parameters while preserving representational capacity.
- Tensor Decomposition: Applying techniques like singular value decomposition (SVD) or Tucker decomposition to compress large parameter tensors.
3.5.5. Sparse Representations
- Sparse Training: Enforcing sparsity constraints during training to produce inherently sparse models.
- Post-Training Sparsification: Applying sparsity-inducing regularizers or thresholding methods to trained models.
- Dynamic Sparsity: Adjusting sparsity patterns dynamically during training or inference to optimize performance[60].
3.5.6. Hybrid Compression Techniques
- Pruning and Quantization: Applying pruning to reduce model size, followed by quantization to accelerate inference.
- Distillation and Low-Rank Factorization: Using knowledge distillation to train a smaller model and factorizing its parameters for additional compression.
- Sparse Quantization: Combining sparsity with quantization to achieve both storage and computational efficiency.
3.5.7. Challenges and Future Directions
- Maintaining Quality: Ensuring that compression does not degrade the quality of generated outputs, especially for high-resolution or multimodal tasks.
- Task-Specific Tuning: Adapting compression techniques to the unique requirements of diffusion models, such as iterative denoising and time-step-dependent operations.
- Scalability: Extending compression methods to accommodate larger models and datasets without excessive computational overhead.
3.6. Hybrid and Modular Approaches
3.7. Hardware and Optimization Advances
- Mixed-Precision Training: Utilizing lower-precision formats, such as FP16, reduces memory usage and accelerates training without significant loss in accuracy [63].
- Custom Hardware Accelerators: Dedicated accelerators, such as GPUs and TPUs optimized for deep learning workloads, have significantly reduced the computational burden of training and sampling.
- Gradient Accumulation and Checkpointing: These techniques optimize memory usage during training, enabling the handling of larger batch sizes and models.
3.8. Scalability in Multimodal and High-Resolution Applications
- Multimodal Pretraining: Training models on diverse datasets spanning multiple modalities improves their ability to generalize across tasks [65].
- Progressive Resolution Techniques: Generating data at progressively higher resolutions reduces the computational cost of high-resolution synthesis [66].
- Guided Diffusion: Techniques like classifier-free guidance improve the control and quality of generated outputs, especially in multimodal settings [67].
4. Applications of Scalable Diffusion Models
4.1. Image Synthesis and Editing
- High-Resolution Image Generation: Models like DALL·E 2 and Stable Diffusion generate detailed images at resolutions up to 4K, enabling applications in digital art, design, and entertainment [73].
- Image Inpainting and Editing: Diffusion models can seamlessly fill in missing parts of an image or edit existing images with user-specified modifications, making them valuable tools for content creation [74].
- Style Transfer and Customization: By conditioning on specific style or content inputs, diffusion models can generate images tailored to user preferences [75].
4.2. Text-to-Image Synthesis
- Creative Content Generation: Artists and designers use text-to-image models to create illustrations, concept art, and visual storytelling elements [77].
- Advertising and Marketing: Businesses employ these models to generate customized visuals for advertisements and promotional materials based on specific themes or messages [78].
- Accessibility and Education: Text-to-image models enhance accessibility by generating visual aids for educational content or assisting visually impaired individuals in understanding textual information.
4.3. Audio and Speech Generation
- Speech Synthesis: Diffusion models generate natural-sounding speech with high fidelity, finding use in virtual assistants, dubbing, and accessibility tools [79].
- Music Generation: These models create original compositions or remix existing tracks, aiding musicians and content creators in their workflows [80].
- Sound Effects Design: Generating realistic sound effects for films, games, and virtual environments is another emerging application of diffusion-based audio models [81].
4.4. Molecular and Drug Design
- Drug Discovery: Diffusion models assist in identifying potential drug candidates by exploring vast chemical spaces efficiently [82].
- Protein Design: These models generate protein structures optimized for specific functions, accelerating advancements in biotechnology and medicine.
- Material Science: Generating new materials with tailored properties is another area where diffusion models are being actively explored [83].
4.5. Creative Applications
- Digital Art and Animation: Artists use diffusion models to create unique artworks and animations, expanding the possibilities of creative expression.
- Game Design: These models generate assets, such as characters, environments, and textures, streamlining the game development process [84].
- Film and Media Production: Diffusion models aid in visual effects creation, storyboarding, and content generation for films and media projects.
4.6. Healthcare and Medical Imaging
- Medical Image Reconstruction: Diffusion models improve the quality of medical images, such as MRI and CT scans, by denoising and enhancing resolution [85].
- Anomaly Detection: These models assist in identifying anomalies in medical images, aiding in early diagnosis and treatment planning.
- Synthetic Data Generation: Generating synthetic medical data helps address data scarcity while preserving patient privacy [86].
4.7. Scientific Research and Simulation
- Climate Modeling: Generating high-resolution climate simulations to predict weather patterns and study environmental changes [87].
- Physics Simulations: Modeling complex physical systems, such as fluid dynamics and particle interactions, with high accuracy [88].
- Astronomy and Space Exploration: Enhancing astronomical images and generating realistic simulations of celestial phenomena [89].
4.8. Open Challenges in Applications
5. Open Challenges and Future Directions
5.1. Sampling Efficiency
- Reduced-Step Sampling: Developing techniques that minimize the number of timesteps required for sampling without compromising output quality, such as improved noise schedules and learned sampling strategies [97].
- Parallel Sampling: Exploring methods to parallelize the sampling process, leveraging advancements in hardware accelerators and distributed computing.
- Hybrid Approaches: Combining diffusion models with other generative frameworks, such as GANs, to leverage the fast sampling capabilities of alternative methods [98].
5.2. Training Efficiency and Resource Requirements
- Data-Efficient Training: Developing training paradigms that require fewer data samples, such as self-supervised learning and transfer learning.
- Efficient Architectures: Designing lightweight and modular architectures that reduce memory and computational overhead while maintaining performance [100].
- Energy Efficiency: Investigating energy-efficient training methods to reduce the environmental impact of large-scale diffusion models [101].
5.3. Scalability to High-Resolution and Multimodal Tasks
- Progressive Resolution Techniques: Implementing hierarchical or progressive generation strategies to reduce computational costs for high-resolution tasks [102].
- Unified Multimodal Models: Developing models that can seamlessly integrate and process multiple data modalities, leveraging shared representations and cross-modal attention mechanisms [103].
- Task-Specific Adaptations: Tailoring diffusion models to specific tasks, optimizing their performance for targeted applications.
5.4. Robustness and Generalization
- Adversarial Robustness: Enhancing the resilience of diffusion models to adversarial attacks through robust training techniques [104].
- Domain Adaptation: Improving the ability of models to generalize across domains with limited or no fine-tuning.
- Uncertainty Quantification: Incorporating mechanisms to quantify and manage uncertainty in generated outputs, particularly in high-stakes applications [105].
5.5. Ethical and Societal Considerations
- Bias Mitigation: Identifying and mitigating biases in training datasets and model outputs to promote fairness and inclusivity.
- Content Moderation: Implementing safeguards to prevent the generation of harmful or malicious content, such as misinformation or explicit imagery.
- Transparency and Accountability: Enhancing the interpretability of diffusion models and establishing clear accountability frameworks for their use [108].
5.6. Theoretical Understanding and Interpretability
- Optimization Dynamics: Analyzing the training and sampling dynamics of diffusion models to identify areas for improvement [111].
- Connections to Other Frameworks: Exploring the relationships between diffusion models and other generative approaches, such as energy-based models and normalizing flows.
- Interpretability Techniques: Developing tools and methods to interpret the decisions and outputs of diffusion models, particularly in critical applications [112].
5.7. Expanding Applications
- Healthcare: Advancing applications in medical imaging, drug discovery, and personalized medicine [115].
- Education: Enhancing educational tools through interactive content generation and multimodal learning aids [116].
- Environmental Science: Supporting climate modeling, ecological simulations, and sustainable development initiatives [117].
5.8. Future Outlook
6. Conclusion
References
- Han, Y.; Zhang, C.; Chen, X.; Yang, X.; Wang, Z.; Yu, G.; Fu, B.; Zhang, H. ChartLlama: A Multimodal LLM for Chart Understanding and Generation, 2023, [arXiv:cs.CV/2311.16483].
- Yu, L.; Xiang, W. X-pruner: explainable pruning for vision transformers. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 24355–24363.
- Lo, K.M.; Liang, Y.; Du, W.; Fan, Y.; Wang, Z.; Huang, W.; Ma, L.; Fu, J. m2mKD: Module-to-Module Knowledge Distillation for Modular Transformers. arXiv preprint arXiv:2402.16918 2024.
- Zhao, W.; Han, Y.; Tang, J.; Wang, K.; Song, Y.; Huang, G.; Wang, F.; You, Y. Dynamic Diffusion Transformer. arXiv preprint arXiv:2410.03456 2024.
- Liu, Y.; Yang, H.; Dong, Z.; Keutzer, K.; Du, L.; Zhang, S. Noisyquant: Noisy bias-enhanced post-training activation quantization for vision transformers. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20321–20330.
- Chu, X.; Qiao, L.; Lin, X.; Xu, S.; Yang, Y.; Hu, Y.; Wei, F.; Zhang, X.; Zhang, B.; Wei, X.; et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886 2023.
- Cha, J.; Kang, W.; Mun, J.; Roh, B. Honeybee: Locality-enhanced projector for multimodal llm. arXiv preprint arXiv:2312.06742 2023.
- Liang, Y.; Ge, C.; Tong, Z.; Song, Y.; Wang, J.; Xie, P. Not all patches are what you need: Expediting vision transformers via token reorganizations. arXiv preprint arXiv:2202.07800 2022.
- Singh, A.; Natarajan, V.; Shah, M.; Jiang, Y.; Chen, X.; Batra, D.; Parikh, D.; Rohrbach, M. Towards vqa models that can read. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8317–8326.
- Marin, D.; Chang, J.H.R.; Ranjan, A.; Prabhu, A.; Rastegari, M.; Tuzel, O. Token pooling in vision transformers. arXiv preprint arXiv:2110.03860 2021.
- Gao, P.; Zhang, R.; Liu, C.; Qiu, L.; Huang, S.; Lin, W.; Zhao, S.; Geng, S.; Lin, Z.; Jin, P.; et al. SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models. arXiv preprint arXiv:2402.05935 2024.
- Azizi, S.; Mustafa, B.; Ryan, F.; Beaver, Z.; Freyberg, J.; Deaton, J.; Loh, A.; Karthikesalingam, A.; Kornblith, S.; Chen, T.; et al. Big self-supervised models advance medical image classification. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3478–3488.
- Zhai, X.; Mustafa, B.; Kolesnikov, A.; Beyer, L. Sigmoid loss for language image pre-training. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 11975–11986.
- Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International conference on machine learning. PMLR, 2023, pp. 19730–19742.
- Meng, C.; He, Y.; Song, Y.; Song, J.; Wu, J.; Zhu, J.Y.; Ermon, S. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In Proceedings of the International Conference on Learning Representations, 2022.
- Sun, Q.; Cui, Y.; Zhang, X.; Zhang, F.; Yu, Q.; Luo, Z.; Wang, Y.; Rao, Y.; Liu, J.; Huang, T.; et al. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286 2023.
- Sauer, A.; Boesel, F.; Dockhorn, T.; Blattmann, A.; Esser, P.; Rombach, R. Fast high-resolution image synthesis with latent adversarial diffusion distillation. arXiv preprint arXiv:2403.12015 2024.
- Li, Y.; Yuan, G.; Wen, Y.; Hu, J.; Evangelidis, G.; Tulyakov, S.; Wang, Y.; Ren, J. Efficientformer: Vision transformers at mobilenet speed. Advances in Neural Information Processing Systems 2022, 35, 12934–12949.
- Song, Y.; Ermon, S. Generative modeling by estimating gradients of the data distribution. In Proceedings of the Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019, pp. 11918–11930.
- Watson, D.; Chan, W.; Ho, J.; Norouzi, M. Learning fast samplers for diffusion models by differentiating through sample quality. In Proceedings of the International Conference on Learning Representations, 2022.
- Chavan, A.; Shen, Z.; Liu, Z.; Liu, Z.; Cheng, K.T.; Xing, E.P. Vision transformer slimming: Multi-dimension searching in continuous optimization space. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4931–4941.
- Zhou, B.; Hu, Y.; Weng, X.; Jia, J.; Luo, J.; Liu, X.; Wu, J.; Huang, L. TinyLLaVA: A Framework of Small-scale Large Multimodal Models. arXiv preprint arXiv:2402.14289 2024.
- Liu, L.; Ren, Y.; Lin, Z.; Zhao, Z. Pseudo Numerical Methods for Diffusion Models on Manifolds. In Proceedings of the International Conference on Learning Representations, 2022.
- Schwenk, D.; Khandelwal, A.; Clark, C.; Marino, K.; Mottaghi, R. A-okvqa: A benchmark for visual question answering using world knowledge. In Proceedings of the European Conference on Computer Vision. Springer, 2022, pp. 146–162.
- Luhman, E.; Luhman, T. Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388 2021.
- Ho, J.; Salimans, T. Classifier-Free Diffusion Guidance. In Proceedings of the NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- Chung, H.; Sim, B.; Ye, J.C. Come-closer-diffuse-faster: Accelerating conditional diffusion models for inverse problems through stochastic contraction. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12413–12422.
- Zhu, Y.; Zhu, M.; Liu, N.; Ou, Z.; Mou, X.; Tang, J. LLaVA-phi: Efficient Multi-Modal Assistant with Small Language Model. arXiv preprint arXiv:2401.02330 2024.
- Zhang, Q.; Chen, Y. Fast Sampling of Diffusion Models with Exponential Integrator. In Proceedings of the International Conference on Learning Representations, 2023.
- Dai, W.; Li, J.; Li, D.; Tiong, A.M.H.; Zhao, J.; Wang, W.; Li, B.; Fung, P.N.; Hoi, S. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems 2024, 36.
- Liu, X.; Gong, C.; et al. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. In Proceedings of the The Eleventh International Conference on Learning Representations, 2022.
- Yin, T.; Gharbi, M.; Zhang, R.; Shechtman, E.; Durand, F.; Freeman, W.T.; Park, T. One-step diffusion with distribution matching distillation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6613–6623.
- Lepikhin, D.; Lee, H.; Xu, Y.; Chen, D.; Firat, O.; Huang, Y.; Krikun, M.; Shazeer, N.; Chen, Z. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668 2020.
- Pan, Z.; Zhuang, B.; Huang, D.A.; Nie, W.; Yu, Z.; Xiao, C.; Cai, J.; Anandkumar, A. T-stitch: Accelerating sampling in pre-trained diffusion models with trajectory stitching. arXiv preprint arXiv:2402.14167 2024.
- Zhao, Y.; Xu, Y.; Xiao, Z.; Jia, H.; Hou, T. MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices, 2024, [arXiv:cs.CV/2311.16567].
- Tu, T.; Azizi, S.; Driess, D.; Schaekermann, M.; Amin, M.; Chang, P.C.; Carroll, A.; Lau, C.; Tanno, R.; Ktena, I.; et al. Towards generalist biomedical ai. NEJM AI 2024, 1, AIoa2300138. [CrossRef]
- Yu, S.; Chen, T.; Shen, J.; Yuan, H.; Tan, J.; Yang, S.; Liu, J.; Wang, Z. Unified visual transformer compression. arXiv preprint arXiv:2203.08243 2022.
- Liu, Z.; Wang, Y.; Han, K.; Zhang, W.; Ma, S.; Gao, W. Post-training quantization for vision transformer. Advances in Neural Information Processing Systems 2021, 34, 28092–28103.
- Lu, C.; Zhou, Y.; Bao, F.; Chen, J.; Li, C.; Zhu, J. DPM-solver: a fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In Proceedings of the Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022, pp. 5775–5787.
- Chen, Z.; Ma, X.; Fang, G.; Tan, Z.; Wang, X. AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising, 2024, [arXiv:cs.CV/2406.06911].
- Valipour, M.; Rezagholizadeh, M.; Kobyzev, I.; Ghodsi, A. Dylora: Parameter efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. arXiv preprint arXiv:2210.07558 2022.
- Shi, B.; Wu, Z.; Mao, M.; Wang, X.; Darrell, T. When Do We Not Need Larger Vision Models? arXiv preprint arXiv:2403.13043 2024.
- Chen, J.; Yu, Q.; Shen, X.; Yuille, A.; Chen, L.C. ViTamin: Designing Scalable Vision Models in the Vision-Language Era, 2024, [arXiv:cs.CV/2404.02132].
- Dong, X.; Zhang, P.; Zang, Y.; Cao, Y.; Wang, B.; Ouyang, L.; Zhang, S.; Duan, H.; Zhang, W.; Li, Y.; et al. InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD. arXiv preprint arXiv:2404.06512 2024.
- Zhu, W.; Hessel, J.; Awadalla, A.; Gadre, S.Y.; Dodge, J.; Fang, A.; Yu, Y.; Schmidt, L.; Wang, W.Y.; Choi, Y. Multimodal c4: An open, billion-scale corpus of images interleaved with text. Advances in Neural Information Processing Systems 2024, 36.
- Wu, Q.; Liu, Y.; Zhao, H.; Kale, A.; Bui, T.; Yu, T.; Lin, Z.; Zhang, Y.; Chang, S. Uncovering the disentanglement capability in text-to-image diffusion models. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 1900–1910.
- Esser, P.; Kulal, S.; Blattmann, A.; Entezari, R.; Müller, J.; Saini, H.; Levi, Y.; Lorenz, D.; Sauer, A.; Boesel, F.; et al. Scaling rectified flow transformers for high-resolution image synthesis. In Proceedings of the International Conference on Machine Learning, 2024.
- Chen, X.; Fang, H.; Lin, T.Y.; Vedantam, R.; Gupta, S.; Dollár, P.; Zitnick, C.L. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 2015.
- Luo, S.; Tan, Y.; Patil, S.; Gu, D.; von Platen, P.; Passos, A.; Huang, L.; Li, J.; Zhao, H. Lcm-lora: A universal stable-diffusion acceleration module. arXiv preprint arXiv:2311.05556 2023.
- Chen, T.; Cheng, Y.; Gan, Z.; Yuan, L.; Zhang, L.; Wang, Z. Chasing sparsity in vision transformers: An end-to-end exploration. Advances in Neural Information Processing Systems 2021, 34, 19974–19988.
- Yuan, Z.; Xue, C.; Chen, Y.; Wu, Q.; Sun, G. Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization. In Proceedings of the European conference on computer vision. Springer, 2022, pp. 191–207.
- Chen, M.; Peng, H.; Fu, J.; Ling, H. Autoformer: Searching transformers for visual recognition. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12270–12280.
- Shang, Y.; Yuan, Z.; Xie, B.; Wu, B.; Yan, Y. Post-training quantization on diffusion models. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 1972–1981.
- Lin, B.; Zhu, B.; Ye, Y.; Ning, M.; Jin, P.; Yuan, L. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection. arXiv preprint arXiv:2311.10122 2023.
- Zniyed, Y.; Nguyen, T.P.; et al. Efficient tensor decomposition-based filter pruning. Neural Networks 2024, 178, 106393. [CrossRef]
- Saleh, B.; Elgammal, A. Large-scale classification of fine-art paintings: Learning the right metric on the right feature. arXiv preprint arXiv:1505.00855 2015.
- Brock, A.; Donahue, J.; Simonyan, K. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In Proceedings of the International Conference on Learning Representations, 2019.
- Fan, Y.; Lee, K. Optimizing DDPM Sampling with Shortcut Fine-Tuning. In Proceedings of the International Conference on Machine Learning, 2023, pp. 9623–9639.
- Zniyed, Y.; Nguyen, T.P.; et al. Enhanced network compression through tensor decompositions and pruning. IEEE Transactions on Neural Networks and Learning Systems 2024. [CrossRef]
- Chen, Y.H.; Sarokin, R.; Lee, J.; Tang, J.; Chang, C.L.; Kulik, A.; Grundmann, M. Speed is all you need: On-device acceleration of large diffusion models via gpu-aware optimizations. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4651–4655.
- Changpinyo, S.; Sharma, P.; Ding, N.; Soricut, R. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3558–3568.
- Xu, C.; Sun, Q.; Zheng, K.; Geng, X.; Zhao, P.; Feng, J.; Tao, C.; Jiang, D. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 2023.
- Ye, Q.; Xu, H.; Ye, J.; Yan, M.; Hu, A.; Liu, H.; Qian, Q.; Zhang, J.; Huang, F.; Zhou, J. mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration, 2023, [arXiv:cs.CL/2311.04257].
- Shang, Y.; Cai, M.; Xu, B.; Lee, Y.J.; Yan, Y. LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models, 2024, [arXiv:cs.CV/2403.15388].
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- Wang, J.; Fang, J.; Li, A.; Yang, P. PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models, 2024, [arXiv:cs.CV/2405.14430].
- Wang, G.; Liu, J.; Li, C.; Ma, J.; Zhang, Y.; Wei, X.; Zhang, K.; Chong, M.; Zhang, R.; Liu, Y.; et al. Cloud-Device Collaborative Learning for Multimodal Large Language Models. arXiv preprint arXiv:2312.16279 2023.
- Gupta, A.; Gu, A.; Berant, J. Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems 2022, 35, 22982–22994.
- Huang, L.; Wu, S.; Cui, Y.; Xiong, Y.; Liu, X.; Kuo, T.W.; Guan, N.; Xue, C.J. RAEE: A Training-Free Retrieval-Augmented Early Exiting Framework for Efficient Inference. arXiv preprint arXiv:2405.15198 2024.
- Li, K.; He, Y.; Wang, Y.; Li, Y.; Wang, W.; Luo, P.; Wang, Y.; Wang, L.; Qiao, Y. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355 2023.
- Papa, L.; Russo, P.; Amerini, I.; Zhou, L. A Survey on Efficient Vision Transformers: Algorithms, Techniques, and Performance Benchmarking. IEEE Transactions on Pattern Analysis and Machine Intelligence 2024, p. 1–20. [CrossRef]
- Guo, Y.; Yang, C.; Rao, A.; Liang, Z.; Wang, Y.; Qiao, Y.; Agrawala, M.; Lin, D.; Dai, B. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning. In Proceedings of the International Conference on Learning Representations, 2024. [CrossRef]
- Liu, X.; Zhang, X.; Ma, J.; Peng, J.; et al. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. In Proceedings of the The Twelfth International Conference on Learning Representations, 2023.
- Laurençon, H.; Saulnier, L.; Tronchon, L.; Bekman, S.; Singh, A.; Lozhkov, A.; Wang, T.; Karamcheti, S.; Rush, A.; Kiela, D.; et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems 2024, 36.
- Heo, B.; Yun, S.; Han, D.; Chun, S.; Choe, J.; Oh, S.J. Rethinking spatial dimensions of vision transformers. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 11936–11945.
- Song, Y.; Dhariwal, P.; Chen, M.; Sutskever, I. Consistency Models. In Proceedings of the International Conference on Machine Learning, 2023, pp. 32211–32252.
- Zhang, J.; Peng, H.; Wu, K.; Liu, M.; Xiao, B.; Fu, J.; Yuan, L. Minivit: Compressing vision transformers with weight multiplexing. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12145–12154.
- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11976–11986.
- Zhang, L.; Hu, A.; Xu, H.; Yan, M.; Xu, Y.; Jin, Q.; Zhang, J.; Huang, F. TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning. arXiv preprint arXiv:2404.16635 2024.
- Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. Advances in neural information processing systems 2021, 34, 8780–8794.
- He, Y.; Liu, L.; Liu, J.; Wu, W.; Zhou, H.; Zhuang, B. PTQD: accurate post-training quantization for diffusion models. In Proceedings of the Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023, pp. 13237–13249.
- Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6904–6913.
- Gurari, D.; Li, Q.; Stangl, A.J.; Guo, A.; Lin, C.; Grauman, K.; Luo, J.; Bigham, J.P. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3608–3617.
- Lu, H.; Liu, W.; Zhang, B.; Wang, B.; Dong, K.; Liu, B.; Sun, J.; Ren, T.; Li, Z.; Yang, H.; et al. DeepSeek-VL: Towards Real-World Vision-Language Understanding, 2024, [arXiv:cs.AI/2403.05525].
- Liu, Y.; Duan, H.; Zhang, Y.; Li, B.; Zhang, S.; Zhao, W.; Yuan, Y.; Wang, J.; He, C.; Liu, Z.; et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 2023.
- Marino, K.; Rastegari, M.; Farhadi, A.; Mottaghi, R. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, 2019, pp. 3195–3204.
- Xu, R.; Yao, Y.; Guo, Z.; Cui, J.; Ni, Z.; Ge, C.; Chua, T.S.; Liu, Z.; Sun, M.; Huang, G. LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images, 2024, [arXiv:cs.CV/2403.11703].
- Mao, J.; Huang, J.; Toshev, A.; Camburu, O.; Yuille, A.L.; Murphy, K. Generation and comprehension of unambiguous object descriptions. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 11–20.
- Kar, O.F.; Tonioni, A.; Poklukar, P.; Kulshrestha, A.; Zamir, A.; Tombari, F. BRAVE: Broadening the visual encoding of vision-language models. arXiv preprint arXiv:2404.07204 2024.
- Luo, S.; Tan, Y.; Huang, L.; Li, J.; Zhao, H. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 2023.
- Zhang, P.; Zeng, G.; Wang, T.; Lu, W. TinyLlama: An Open-Source Small Language Model, 2024, [arXiv:cs.CL/2401.02385].
- Yuan, Z.; Li, Z.; Sun, L. Tinygpt-v: Efficient multimodal large language model via small backbones. arXiv preprint arXiv:2312.16862 2023.
- Lin, Z.; Lin, M.; Lin, L.; Ji, R. Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference, 2024, [arXiv:cs.CV/2405.05803].
- Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Belanger, D.; Colwell, L.; et al. Masked language modeling for proteins via linearly scalable long-context transformers. arXiv preprint arXiv:2006.03555 2020.
- Meng, C.; Rombach, R.; Gao, R.; Kingma, D.; Ermon, S.; Ho, J.; Salimans, T. On distillation of guided diffusion models. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14297–14306.
- Hudson, D.A.; Manning, C.D. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6700–6709.
- Kazemzadeh, S.; Ordonez, V.; Matten, M.; Berg, T. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 787–798.
- Qiao, Y.; Yu, Z.; Guo, L.; Chen, S.; Zhao, Z.; Sun, M.; Wu, Q.; Liu, J. VL-Mamba: Exploring State Space Models for Multimodal Learning. arXiv preprint arXiv:2403.13600 2024.
- Jie, S.; Tang, Y.; Ding, N.; Deng, Z.H.; Han, K.; Wang, Y. Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning, 2024, [arXiv:cs.CV/2405.05615].
- Liu, H.; Tam, D.; Muqeeth, M.; Mohta, J.; Huang, T.; Bansal, M.; Raffel, C. Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning. arXiv preprint arXiv:2205.05638 2022.
- Lin, C.; Peng, B.; Li, Z.; Tan, W.; Ren, Y.; Xiao, J.; Pu, S. Bit-shrinking: Limiting instantaneous sharpness for improving post-training quantization. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 16196–16205.
- Ren, S.; Gao, Z.; Hua, T.; Xue, Z.; Tian, Y.; He, S.; Zhao, H. Co-advise: Cross inductive bias distillation. In Proceedings of the Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2022, pp. 16773–16782.
- Li, Y.; Bubeck, S.; Eldan, R.; Giorno, A.D.; Gunasekar, S.; Lee, Y.T. Textbooks Are All You Need II: phi-1.5 technical report, 2023, [arXiv:cs.CL/2309.05463].
- Zhao, B.; Wu, B.; Huang, T. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087 2023.
- Li, Y.; Zhang, Y.; Wang, C.; Zhong, Z.; Chen, Y.; Chu, R.; Liu, S.; Jia, J. Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models. arXiv preprint arXiv:2403.18814 2024.
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. International journal of computer vision 2015, 115, 211–252.
- Xue, S.; Liu, Z.; Chen, F.; Zhang, S.; Hu, T.; Xie, E.; Li, Z. Accelerating Diffusion Sampling with Optimized Time Steps. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8292–8301. [CrossRef]
- Yu, C.; Chen, T.; Gan, Z.; Fan, J. Boost vision transformer with gpu-friendly sparsity and quantization. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22658–22668.
- Zhao, W.; Bai, L.; Rao, Y.; Zhou, J.; Lu, J. UniPC: a unified predictor-corrector framework for fast sampling of diffusion models. In Proceedings of the Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023, pp. 49842–49869.
- ShareGPT. https://sharegpt.com/, 2023.
- Chen, L.; Zhao, H.; Liu, T.; Bai, S.; Lin, J.; Zhou, C.; Chang, B. An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models, 2024, [arXiv:cs.CV/2403.06764].
- Wang, H.; Wang, Y.; Ye, Y.; Nie, Y.; Huang, C. Elysium: Exploring Object-level Perception in Videos via MLLM, 2024, [arXiv:cs.CV/2403.16558].
- Zheng, H.; Nie, W.; Vahdat, A.; Azizzadenesheli, K.; Anandkumar, A. Fast sampling of diffusion models via operator learning. In Proceedings of the International conference on machine learning, 2023, pp. 42390–42402.
- LAION. Gpt-4v dataset. https://huggingface.co/datasets/laion/gpt4v-dataset, 2023.
- Wang, W.; Chen, W.; Qiu, Q.; Chen, L.; Wu, B.; Lin, B.; He, X.; Liu, W. Crossformer++: A versatile vision transformer hinging on cross-scale attention. IEEE Transactions on Pattern Analysis and Machine Intelligence 2023. [CrossRef]
- Abdin, M.; Jacobs, S.A.; Awan, A.A.; Aneja, J.; Awadallah, A.; Awadalla, H.; Bach, N.; Bahree, A.; Bakhtiari, A.; Behl, H.; et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone, 2024, [arXiv:cs.CL/2404.14219].
- Li, Z.; Sun, M.; Lu, A.; Ma, H.; Yuan, G.; Xie, Y.; Tang, H.; Li, Y.; Leeser, M.; Wang, Z.; et al. Auto-vit-acc: An fpga-aware automatic acceleration framework for vision transformer with mixed-scheme quantization. In Proceedings of the 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2022, pp. 109–116.
- Salimans, T.; Ho, J. Progressive Distillation for Fast Sampling of Diffusion Models. In Proceedings of the International Conference on Learning Representations, 2022.
- Hu, Z.; Lan, Y.; Wang, L.; Xu, W.; Lim, E.P.; Lee, R.K.W.; Bing, L.; Poria, S. LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models. arXiv preprint arXiv:2304.01933 2023.
- Fayyaz, M.; Koohpayegani, S.A.; Jafari, F.R.; Sengupta, S.; Joze, H.R.V.; Sommerlade, E.; Pirsiavash, H.; Gall, J. Adaptive token sampling for efficient vision transformers. In Proceedings of the European Conference on Computer Vision. Springer, 2022, pp. 396–414.
- Ye, Q.; Xu, H.; Xu, G.; Ye, J.; Yan, M.; Zhou, Y.; Wang, J.; Hu, A.; Shi, P.; Shi, Y.; et al. mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality, 2023, [arXiv:cs.CL/2304.14178].
- Li, B.; Wang, R.; Wang, G.; Ge, Y.; Ge, Y.; Shan, Y. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125 2023.
- Driess, D.; Xia, F.; Sajjadi, M.S.M.; Lynch, C.; Chowdhery, A.; Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; et al. PALM-E: An Embodied Multimodal Language Model. arXiv preprint arXiv:2303.03378 2023.
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 2023.
- Lv, K.; Yang, Y.; Liu, T.; Gao, Q.; Guo, Q.; Qiu, X. Full Parameter Fine-tuning for Large Language Models with Limited Resources. arXiv preprint arXiv:2306.09782 2023.
- Lyu, Z.; Xu, X.; Yang, C.; Lin, D.; Dai, B. Accelerating diffusion models via early stop of the diffusion process. arXiv preprint arXiv:2205.12524 2022.
- Chen, L.; Li, J.; Dong, X.; Zhang, P.; He, C.; Wang, J.; Zhao, F.; Lin, D. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 2023.
- Dockhorn, T.; Vahdat, A.; Kreis, K. GENIE: higher-order denoising diffusion solvers. In Proceedings of the Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022, pp. 30150–30166.
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).