Preprint
Article

This version is not peer-reviewed.

Synergistic Multimodal Diffusion Transformer: Unifying and Enhancing Multimodal Generation via Adaptive Discrete Diffusion

Submitted:

28 January 2026

Posted:

29 January 2026

You are already at the latest version

Abstract
Current multimodal artificial intelligence suffers from fragmentation, with models typically optimized for single tasks, impeding efficient and uniform handling of diverse tasks like Text-to-Image (T2I), Image-to-Text (I2T), and Visual Question Answering (VQA) within a single framework. To address this, we propose the Synergistic Multimodal Diffusion Transformer (SyMDit), a novel unified discrete diffusion model. SyMDit integrates an Adaptive Cross-Modal Transformer (ACMT) with a Synergistic Attention Module (SAM) for dynamic interaction, alongside Hierarchical Semantic Visual Tokenization (HSVT) for multi-scale visual understanding and Context-Aware Text Embedding with special tokens for nuanced textual representation. Trained under a unified discrete diffusion paradigm, SyMDit employs a multi-stage strategy, including advanced data augmentation and selective masking. Our extensive evaluations demonstrate that SyMDit consistently achieves superior performance across T2I, I2T, and VQA tasks, outperforming existing baselines. Furthermore, SyMDit significantly enhances inference efficiency, offering substantial speedups compared to autoregressive and prior discrete diffusion methods. This work presents a significant step towards truly unified and efficient multimodal AI, offering a robust framework for general-purpose multimodal intelligence.
Keywords: 
;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated