Federated learning (FL) has become a burgeoning and attractive research area, which provides a collaborative training scheme for distributed data sources with privacy concerns. Most existing FL studies focus on taking unimodal data, such as images and text, as the model input and resolving the heterogeneity challenge, i.e., the non-identically distributed (non-IID) challenge caused by data distribution imbalance related to data labels and data amount. In real-world applications, data are usually described by multiple modalities. However, to the best of our knowledge, only a handful of studies have been proposed to improve the system performance by utilizing multimodal data. In this survey paper, we identify the significance of this emerging research topic – multimodal federated learning (MFL) and perform a literature review on the state-of-art MFL methods. Furthermore, we categorize multi-modal federated learning into congruent and incongruent multimodal federated learning based on whether all clients possess the same modal combinations. We investigate the feasible application tasks and related benchmarks for MFL. Lastly, we summarize the promising directions and fundamental challenges in this field for future research.