Huizhou Intangible Cultural Heritages, represented by wood carving, brick carving, stone carving, and paper cutting, have intricate patterns and stylistic diversity. Existing image classification approaches rely heavily on manual curation, lacking scalability and interpretability for large-scale digital exhibitions. This study proposes a deep learning framework that adapts large-scale vision–language models to Huizhou Intangible Cultural Heritages through information-maximization self-distillation. Specifically, we employ Contrastive Language-Image Pre-Training (CLIP) as a teacher model to generate pseudo-labels for unlabeled Huizhou heritage images, while a student model is adapted through parameter-efficient tuning of visual prompts and LayerNorm parameters. An entropy-based information maximization objective further enhances model confidence and diversity. Extensive experiments demonstrate improvements over zero-shot and existing adaptation baselines. Overall, MM-SHAP offers a novel, information-theoretic paradigm for unsupervised and efficient fine-tuning in exhibition-focused visual classification tasks, balancing performance and efficiency.