Submitted:
21 June 2025
Posted:
23 June 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Contributions
- A lightweight latent diffusion architecture specifically optimized for resource-constrained training, achieving a 10× reduction in parameter count compared to recent models while maintaining generation quality suitable for research applications.
- Demonstration of effective training on the Indiana University Chest X-ray dataset (3,301 images) using a single RTX 4060 GPU, proving that meaningful research can be conducted with consumer hardware.
- Comprehensive optimization strategies including gradient checkpointing, mixed precision training, and parameter-efficient fine-tuning that enable training within 8GB VRAM constraints.
- Detailed ablation studies showing the impact of various design choices on model performance, providing insights for future resource-efficient medical AI development.
- Public release of all code and model weights with extensive documentation to facilitate reproducible research and enable adoption by resource-constrained research groups.
2. Related Work
2.1. Evolution of Medical Image Synthesis
2.2. Diffusion Models in Medical Imaging
2.3. Text-Conditional Medical Image Generation
2.4. Resource-Efficient Deep Learning
3. Methods
3.1. Dataset Description and Preprocessing
3.1.1. Image Preprocessing
3.1.2. Text Preprocessing
- Section Extraction: Findings and impression sections were extracted using regular expressions.
- Noise Removal: Template phrases, measurement notations, and formatting artifacts were removed.
- Standardization: Medical abbreviations were expanded (e.g., "RLL" → "right lower lobe").
- Length Filtering: Reports exceeding 256 tokens were truncated to fit within model constraints.
3.1.3. Data Splitting
- Training: 2,311 images (70%)
- Validation: 330 images (10%)
- Test: 660 images (20%)
3.2. Model Architecture

3.2.1. Variational Autoencoder (VAE)
3.2.2. U-Net Denoising Network
- Time Embedding: Sinusoidal positional encoding for diffusion timesteps
- ResNet Blocks: Channel dimensions: 128 → 256 → 512 → 512
- Attention Mechanisms: Self-attention and cross-attention at 8×8, 16×16, and 32×32 resolutions
- Skip Connections: Preserving fine-grained information
3.2.3. BioBERT Text Encoder
- A projection layer (768→512 dimensions)
- Layer normalization parameters
- The final pooling layer
3.3. Training Procedure
3.3.1. Stage 1: VAE Training
- 200 epochs with AdamW optimizer
- Learning rate: with cosine annealing
- Batch size: 32
- Mixed precision: FP16
3.3.2. Stage 2: Diffusion Model Training
- 480 epochs with frozen VAE weights
- Learning rate: decaying to
- Batch size: 4 with 4-step gradient accumulation
- 10% null conditioning for classifier-free guidance
3.4. Memory Optimization Strategies
- Gradient Checkpointing: Recomputing activations during backpropagation reduced memory by 70% at 20% time cost.
- Mixed Precision Training: FP16 computation reduced memory by 50%.
- Gradient Accumulation: Enabled effective batch size of 16.
- Efficient Attention: Chunked computation reducing peak memory usage.
3.5. Inference Pipeline
4. Experimental Results
4.1. Training Dynamics and Convergence Analysis
4.1.1. VAE Training Convergence
| Epoch | Total Loss | Reconstruction Loss | KL Divergence | SSIM |
|---|---|---|---|---|
| 1 | 0.5329 | 0.5328 | 0.77 | 0.412 |
| 10 | 0.0035 | 0.0032 | 3.52 | 0.823 |
| 50 | 0.0012 | 0.0009 | 2.64 | 0.887 |
| 67 | 0.0010 | 0.0008 | 2.57 | 0.891 |
| 100 | 0.0011 | 0.0008 | 2.71 | 0.889 |
4.1.2. Diffusion Model Training Dynamics
- Rapid Initial Learning (Epochs 1-100): Validation loss decreased from 0.198 to 0.0423
- Gradual Refinement (Epochs 100-350): Slow improvement to 0.0245
- Fine-tuning (Epochs 350-480): Best loss of 0.0221 at epoch 387
| Epoch | Train Loss | Val Loss | Learning Rate | FID Score |
|---|---|---|---|---|
| 50 | 0.0512 | 0.0534 | 145.3 | |
| 100 | 0.0398 | 0.0423 | 98.7 | |
| 200 | 0.0289 | 0.0312 | 76.2 | |
| 387 | 0.0266 | 0.0221 | 52.1 | |
| 480 | 0.0264 | 0.0360 | 54.3 |
4.2. Generation Quality Assessment
4.2.1. Quantitative Metrics
| Metric | Value | std | Description |
|---|---|---|---|
| SSIM | 0.82 | ±0.08 | Structural similarity |
| PSNR | 22.3 dB | ±2.1 | Peak signal-to-noise ratio |
| FID | 52.1 | - | Fréchet Inception Distance |
| IS | 3.84 | ±0.21 | Inception Score |
| LPIPS | 0.234 | ±0.045 | Perceptual similarity |
| MSE | 0.0079 | ±0.0023 | Mean squared error |
4.2.2. Text-Image Alignment Evaluation
| Finding | Precision | Recall | F1-Score |
|---|---|---|---|
| Normal | 0.89 | 0.92 | 0.90 |
| Pneumonia | 0.76 | 0.71 | 0.73 |
| Effusion | 0.81 | 0.78 | 0.79 |
| Cardiomegaly | 0.84 | 0.86 | 0.85 |
| Pneumothorax | 0.72 | 0.68 | 0.70 |
| Overall | 0.80 | 0.79 | 0.79 |
4.3. Comparative Analysis
| Model | Parameters | Dataset Size | GPUs | Training Time | FID |
|---|---|---|---|---|---|
| Our Model | 148.84M | 3,301 | 1× RTX 4060 | 96h | 52.1 |
| RoentGen | >1B | 377,110 | 8× A100 | 552h | 41.2* |
| Cheff | >500M | 101,205 | 4× V100 | 384h | 38.7* |
4.4. Ablation Studies
4.4.1. Architecture Components
- 8-channel VAE latent space is crucial for quality
- BioBERT significantly outperforms general BERT
- Text conditioning improves anatomical accuracy
| Configuration | Val Loss | FID | Parameters | Memory |
|---|---|---|---|---|
| Full Model | 0.0221 | 52.1 | 148.84M | 7.2 GB |
| 4-channel VAE | 0.0341 | 78.3 | 147.21M | 6.8 GB |
| No attention in VAE | 0.0267 | 59.2 | 142.13M | 6.5 GB |
| Smaller U-Net (50%) | 0.0289 | 64.7 | 128.99M | 5.9 GB |
| No text conditioning | 0.0198 | 71.2 | 40.91M | 4.3 GB |
| BERT vs BioBERT | 0.0276 | 61.4 | 148.84M | 7.2 GB |
4.4.2. Training Strategies
| Strategy | Val Loss | Training Time | Peak Memory |
|---|---|---|---|
| Baseline (FP32, no opt.) | OOM | - | >16 GB |
| + Mixed Precision | 0.0234 | 142h | 11.3 GB |
| + Gradient Checkpoint | 0.0227 | 168h | 8.7 GB |
| + Gradient Accumulation | 0.0221 | 156h | 7.2 GB |
| + All optimizations | 0.0221 | 186h | 7.2 GB |
4.5. Qualitative Analysis
- Normal anatomy: Clear lung fields, normal cardiac silhouette
- Pneumonia: Focal consolidations in appropriate locations
- Cardiomegaly: Enlarged cardiac silhouette
- Pleural effusion: Blunting of costophrenic angles
- Pneumothorax: Absence of lung markings

4.5.1. Failure Case Analysis
| Failure Type | Frequency | Example | Potential Cause |
|---|---|---|---|
| Anatomical implausibility | 8.2% | Ribs crossing midline | Limited training data |
| Wrong laterality | 5.1% | Right pathology on left | Text encoding ambiguity |
| Missing subtle findings | 12.3% | Small nodules | Resolution limitations |
| Unrealistic textures | 3.7% | Pixelated lung fields | VAE compression |
4.6. Computational Efficiency
| Metric | Our Model | Typical Requirements | Efficiency Gain |
|---|---|---|---|
| Training GPUs | 1× RTX 4060 | 8× A100 | 8× |
| GPU Memory | 8 GB | 80 GB | 10× |
| Training Time | 96 hours | 500+ hours | 5.2× |
| Inference Memory | 5.2 GB | 16+ GB | 3.1× |
| Model Storage | 423 MB | 4+ GB | 9.5× |
5. Discussion
5.1. Technical Insights
5.1.1. Latent Space Dimensionality
5.1.2. Parameter-Efficient Fine-tuning
5.1.3. Optimization Synergies
5.2. Clinical Relevance and Applications
- Medical Education: Real-time generation for teaching specific pathologies
- Data Augmentation: Synthetic examples of rare conditions (with validation)
- Privacy-Preserving Research: Sharing models instead of patient data
5.3. Limitations
- Resolution: 256×256 pixels insufficient for subtle clinical findings
- Dataset Size: 3,301 images from single institution limits generalization
- Clinical Validation: No formal radiologist evaluation conducted
- Temporal Information: Single static images without disease progression
5.4. Ethical Considerations
- Misuse Prevention: Watermarking and access controls needed
- Bias Awareness: Limited dataset diversity may affect generation quality for underrepresented populations
- Clinical Safety: Not suitable for diagnostic use without extensive validation
5.5. Future Work
- Architectural: Progressive generation for higher resolutions
- Training: Federated learning across institutions
- Clinical: Radiologist validation and clinical metrics development
6. Conclusion
- 148.84M parameter model trainable in 8GB VRAM
- 96-hour training on 3,301 images using RTX 4060
- 663ms inference time per image
- Complete code release at https://github.com/priyam-choksi/cxr-diffusion
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- azerouni A, Aghdam EK, Heidari M, et al. Diffusion models in medical imaging: A comprehensive survey. Medical Image Analysis. 2023;88:102846. [CrossRef]
- hader F, Müller-Franzes G, Arasteh ST, et al. Medical diffusion–denoising diffusion probabilistic models for 3D medical image generation. Scientific Reports. 2023;13(1):7303.
- luethgen C, Chambon P, Delbrouck JB, et al. RoentGen: Vision-Language Foundation Model for Chest X-ray Generation. arXiv preprint arXiv:2211.12737. 2022.
- hambon P, Bluethgen C, Langlotz CP, Chaudhari A. Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains. arXiv preprint arXiv:2210.04133. 2022.
- itjens G, Kooi T, Bejnordi BE, et al. A survey on deep learning in medical image analysis. Medical Image Analysis. 2017;42:60-88.
- i X, Walia E, Babyn P. Generative adversarial network in medical imaging: A review. Medical Image Analysis. 2019;58:101552. [CrossRef]
- ie D, Trullo R, Lian J, et al. Medical image synthesis with context-aware generative adversarial networks. In: MICCAI 2017. Springer; 2017:417-425.
- osta P, Galdran A, Meyer MI, et al. End-to-end adversarial retinal image synthesis. IEEE Transactions on Medical Imaging. 2017;37(3):781-791. [CrossRef]
- rjovsky M, Chintala S, Bottou L. Wasserstein generative adversarial networks. In: ICML 2017. PMLR; 2017:214-223.
- ingma DP, Welling M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. 2013.
- hen X, Konukoglu E. Unsupervised detection of lesions in brain MRI using constrained adversarial auto-encoders. arXiv preprint arXiv:1806.04972. 2018.
- o J, Jain A, Abbeel P. Denoising diffusion probabilistic models. NeurIPS. 2020;33:6840-6851.
- ong Y, Sohl-Dickstein J, Kingma DP, et al. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456. 2020.
- hariwal P, Nichol A. Diffusion models beat GANs on image synthesis. NeurIPS. 2021;34:8780-8794.
- olleb J, Sandkühler R, Bieder F, et al. Diffusion models for implicit image segmentation ensembles. In: MIDL 2022. PMLR; 2022:1336-1348.
- inaya WH, Tudosiu PD, Dafflon J, et al. Brain imaging generation with latent diffusion models. In: MICCAI Workshop on Deep Generative Models. Springer; 2022:117-126.
- üller-Franzes G, Niehues JM, Khader F, et al. Diffusion probabilistic models beat GANs on medical image synthesis. Scientific Reports. 2023;13(1):13788.
- ombach R, Blattmann A, Lorenz D, et al. High-resolution image synthesis with latent diffusion models. In: CVPR 2022; 2022:10684-10695.
- aharia C, Chan W, Chang H, et al. Palette: Image-to-image diffusion models. In: ACM SIGGRAPH 2022; 2022:1-10.
- hang Z, Yang L, Zheng Y. Translating and segmenting multimodal medical volumes with cycle-and shape-consistency generative adversarial network. In: CVPR 2018; 2018:9242-9251.
- u J, Trevisan Jost V. Text2Brain: Synthesis of brain activation maps from free-form text queries. In: MICCAI 2023. Springer; 2023:605-614.
- ee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234-1240.
- u EJ, Shen Y, Wallis P, et al. LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685. 2021.
- ettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: Efficient finetuning of quantized LLMs. arXiv preprint arXiv:2305.14314. 2023.
- alimans T, Ho J. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512. 2022.
- ao T, Fu D, Ermon S, et al. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. NeurIPS. 2022;35:16344-16359.
- ichol AQ, Dhariwal P. Improved denoising diffusion probabilistic models. In: ICML 2021. PMLR; 2021:8162-8171.
- emner-Fushman D, Kohli MD, Rosenman MB, et al. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association. 2016;23(2):304-310. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).