Interpreting interferometric synthetic aperture radar (InSAR) imagery is a critical task in monitoring volcanic and seismic activity, yet the process usually requires expert knowledge and manual analysis. As the volume of satellite observations continues to increase, automated methods capable of describing and interpreting these images become increasingly important in order to assist geophysical monitoring efforts. In this work, we investigate the feasibility of automated image captioning for InSAR data using modern vision-language models. We utilize the Hephaestus dataset which is a large collection of annotated interferograms focused on volcanic deformation, and apply a series of preprocessing steps to curate a balanced dataset of deforming and non-deforming images. Two generative image captioning architectures, the Generative Image-to-Text Transformer (GIT) and Bootstrapping Language-Image Pretraining (BLIP), are fine-tuned to output natural language descriptions of the InSAR images. In addition, we implement a retrieval-based model that aligns image and text representations within a shared embedding space and retrieves the most semantically similar caption. The performance of these approaches is evaluated using standard captioning metrics and qualitative inspection of generated descriptions. Our results suggest that pre-trained vision–language models can adapt to specialized scientific imagery despite being trained primarily on natural image datasets. This study represents an initial step towards automated interpretation systems capable of assisting researchers in large-scale InSAR monitoring applications.