The essence of cross-modal generation and retrieval modeling for image-text interactions lies in semantic consistency. Among these, cross-modal representation alignment serves as the foundational requirement for achieving cross-modal semantic consistency. The crux of our proposed method entails the collaborative training of an image captioning model(referred to as the ’captioner’) and a pair of contrastive encoders for image-text matching(referred to as the ’concoder’). This synergistic approach yields concurrent enhancements in concoder’s performance, captioner’s proficiency, and cross-modal semantic coherence. We coin the proposed method as ‘ReCap’. To begin, we initialize the concoder through knowledge distillation from a pre-trained model. During the training process, we use the output of the initialized concoder as input to a residual attention network, jointly training it with the captioner to achieve semantic consistency. Subsequently, we iteratively update the concoder during the image-text momentum encoding phase using the residual attention network, creating a closed-loop for semantic consistency resolution. Upon completion of training, the concoder and captioner modules can be used independently to perform tasks related to cross-modal retrieval and cross-modal generation of text and images. In order to substantiate the efficacy of our proposed method, we initially conducted empirical experiments encompassing image-text retrieval and image captioning tasks, employing a widely recognized benchmark dataset, MSCOCO. Subsequently, we assessed the fine-grained cross-modal alignment performance of the concoder through an image classification task on the iNaturalist 2018 dataset. The achieved performance metrics notably surpassed those achieved by several incumbent state-of-the-art models, thus validating the proficiency of our method. Finally, motivated by the practical requirements of handling the nature conservation image data, we performed caption annotation for the iNaturalist 2018 dataset. Subsequently, we trained the ReCap on this annotated data. Experimental results demonstrate that our proposed method can maintain semantic consistency between cross-modal retrieval and image caption generation for species with similar visual features but distinct semantics. This achievement signifies a meaningful exploration in the field of cross-modal semantic consistency representation, holding significant practical value in the domain of biological research.