With the proliferation and development of social media platforms, social media data has become an important source for acquiring spatio-temporal information on various urban events. Providing accurate spatio-temporal information for events contributes to enhancing the capabilities of urban management and emergency response. However, existing research on mining spatio-temporal information of events often focuses solely on textual content, neglecting data from other modalities such as images and videos. Therefore, this study proposes an innovative spatio-temporal information extraction method for multi-modal social media data (MIST-SMMD), which extracts the spatio-temporal information of events from multi-modal data on Weibo at coarse and fine-grained hierarchical levels, serving as a beneficial supplement to existing urban event monitoring methods. This paper takes the "July 20th Zhengzhou Heavy Rainfall" incident as an example, to evaluate and analyze the effectiveness of the proposed method. The results indicate that in the coarse-grained spatial information extraction using only textual data, our method achieves a Spatial Precision of 87.54% within a 60m range, and reaches 100% Spatial Precision for ranges beyond 200m. For fine-grained spatial information extraction, the introduction of other modal data such as images and videos results in a significant improvement in Spatial Error. These results demonstrate the ability of the MIST-SMMD method to extract spatio-temporal information from urban events at both coarse and fine levels, and confirms the significant advantages of multi-modal data in enhancing the precision of spatial information extraction.