Backdoor attacks enable adversaries to embed malicious behavior into machine learning models by poisoning training data with triggers. Researchers focused largely on backdoors in unimodal models. However, the rise of multimodal systems, e.g., vision–language models (VLMs) and multimodal large language models (MLLMs), has significantly increased the attack surface. Multimodal backdoors can exploit cross-modal triggers, representation-level manipulation, instruction-conditioned behaviors, and test-time activation pathways that are not available in unimodal models. Nevertheless, quantifying progress in this field remains challenging due to fragmented datasets, inconsistent threat models, and the lack of standardized evaluation protocols. This methodological inconsistency limits comparative analysis and impedes a systematic understanding of robustness in multimodal settings. This paper presents a meta-research on multimodal backdoor attacks and analyzes how methodological fragmentation undermines reproducibility and cumulative scientific understanding. We argue that standardized benchmarks and backward compatible evaluation protocols are necessary for a reliable and systematic advancement in multimodal backdoor research.