Submitted:
31 January 2025
Posted:
04 February 2025
You are already at the latest version
Abstract
Keywords:
I. Introduction
- Model architecture selection
- Hyper-parameter impact while training
- Parallelization and Hardware Acceleration with GPUs
- Model Quantization and Compression
II. Background and Related Work
a. Lora Based Fine-Tuning
b. Model Quantization
c. Image Upscaling
III. Dataset Used
IV. Methodology
- Image Resolution and Augmentation
- Training Steps, Epochs, and batch size
- Learning Rate and Max Gradient Normalization
- Warmup steps and Learning Rate Scheduler
V. Results
- Generated Image Samples: A range of prompts were used to create representative images in order to assess the effectiveness of the optimized diffusion model. These examples show how the model can generate excellent, aesthetically pleasing results in a variety of challenging situations. Due to upscaling, the image samples were output in a resolution of 1024x1024, with preserved image quality. Figure 3 shows some sample outputs generated from the pipeline.

- Memory Utilization and Inference Speed: During the inference of model, the team kept a track of resource utilization. It was noted that dynamic quantization significantly reduced memory footprint and accelerated inference. Reduction from fp16 to int8 was the correct decision taken while building the pipeline, in order to create an efficient pipeline.
- Memory Consumption: The complete fine-tuned fp16 model when loaded on Colab T4 took around 6.1 GB of constant memory space for inference. While the int8 quantized model took nearly half memory, at 3.5 GB to inference the images.
- Inference Time: The inference time was measured between 3 models - default, fine-tuned fp16, and int8 quantized model. Alongside this, the models were also trained on a combination of different inference steps, with values as 50, 25 and 10, with a constant guidance scale of 20. The results have been compiled in Figure 4.

- Realistic Crocodile wearing a sweater
- A alpaca made of colorful building blocks, cyberpunk.
-
Realistic Astronaut, Behance HD, riding horse on, with cosmic malestom in back.
- Training Efficiency: The LoRA-based fine-tuning pipeline achieved efficient training. The training loss graph shows that the model reached a stable training loss at higher steps. The training loss was monitored using 'Weights & Baises' and is shown in Figure 5.

VI. Conclusion and Future Scope
References
- Ramesh et al., “Zero-Shot Text-to-Image Generation,” International Conference on Machine Learning, pp. 8821–8831, Jul. 2021.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-Resolution Image Synthesis with Latent Diffusion Models” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10674–10685, Jun. 2022. [CrossRef]
- Vaswani et al., “Attention is All you Need,” arXiv (Cornell University), vol. 30, pp. 5998–6008, Jun. 2017.
- P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, “Jukebox: a generative model for music,” arXiv (Cornell University), Jan. 2020. [CrossRef]
- J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep Unsupervised Learning using Nonequilibrium Thermodynamics,” International Conference on Machine Learning, pp. 2256–2265, Jul. 2015.
- J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, “Cascaded diffusion models for high fidelity image generation,” May 30, 2021.https://arxiv.org/abs/2106.15282.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” arXiv.org, Jun. 19, 2020. https://arxiv.org/abs/2006.11239.
- Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” arXiv (Cornell University), vol. 32, pp. 11895–11907, Sep. 2019.
- N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “WaveGrad: Estimating gradients for waveform Generation,” arXiv.org, Sep. 02, 2020. https://arxiv.org/abs/2009.00713.
- M. Ding et al., “CogView: Mastering Text-to-Image Generation via Transformers,” Neural Information Processing Systems, vol. 34, Dec. 2021.
- S. Gu et al., “Vector Quantized diffusion model for Text-to-Image synthesis,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10686–10696, Jun. 2022. [CrossRef]
- Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical Text-Conditional Image Generation with CLIP Latents,” arXiv.org, Apr. 13, 2022. https://arxiv.org/abs/2204.06125.
- Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding,” arXiv (Cornell University), Jan. 2022. [CrossRef]
- X. Li et al., “Q-Diffusion: Quantizing diffusion models,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023. [CrossRef]
- E. J. Hu et al., “LORA: Low-Rank adaptation of Large Language Models,” arXiv.org, Jun. 17, 2021. https://arxiv.org/abs/2106.09685.
- T. Dettmers, M. Lewis, S. Shleifer, and L. Zettlemoyer, “8-bit optimizers via block-wise quantization,” arXiv (Cornell University), Jan. 2021. [CrossRef]
- Ledig et al., “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 105–114, Jul. 2017. [CrossRef]
- Huggingface Super-resolution https://huggingface.co /docs/diffusers/ /pipelines/stable_diffusion/upscale.
- Huggingface “roborovski/celeba-faces-captioned · Datasets at Hugging Face.” https://huggingface.co/roborovski/celeba-faces-captioned.
- T. Karras, S. Laine, and T. Aila, “A Style-Based generator architecture for generative adversarial networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 12, pp. 4217–4228, Jan. 2020. [CrossRef]
- D. P. Kingma and M. Welling, “Auto-Encoding variational Bayes,” arXiv (Cornell University), Jan. 2013. [CrossRef]
- Y. Song and S. Ermon, “Improved techniques for training Score-Based generative models,” arXiv (Cornell University), Jan. 2020. [CrossRef]
- P. Vincent, "A Connection Between Score Matching and Denoising Autoencoders," in Neural Computation, vol. 23, no. 7, pp. 1661-1674, July 2011. [CrossRef]
- Q. Huang et al., “Noise2Music: Text-conditioned Music Generation with Diffusion Models,” arXiv (Cornell University), Jan. 2023. [CrossRef]
- J. Seo and J. Kang, “RAQ-VAE: Rate-Adaptive Vector-Quantized Variational Autoencoder,” arXiv (Cornell University), May 2024. [CrossRef]
- Y. He, L. Liu, J. Liu, W. Wu, H. Zhou, and B. Zhuang, “PTQD: Accurate Post-Training Quantization for Diffusion Models,” arXiv (Cornell University), Jan. 2023. [CrossRef]
- S. Jung et al., “Learning to quantize deep networks by optimizing quantization intervals with task loss,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2019. [CrossRef]
- T. B. Brown et al., “Language Models are Few-Shot Learners,” Neural Information Processing Systems, vol. 33, pp. 1877–1901, May 2020.
- S. Hayou, N. Ghosh, and B. Yu, “LORA+: efficient low rank adaptation of large models,” arXiv (Cornell University), Feb. 2024. [CrossRef]
- Y. Mao et al., “A survey on LoRA of large language models,” Frontiers of Computer Science, vol. 19, no. 7, Dec. 2024. [CrossRef]
- Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” arXiv (Cornell University), Jan. 2019. [CrossRef]
- R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating Long Sequences with Sparse Transformers,” arXiv (Cornell University), Jan. 2019. [CrossRef]
- T.-Y. Lin et al., “Microsoft COCO: Common Objects in context,” arXiv (Cornell University), Jan. 2014. [CrossRef]


Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).